All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] IO scheduler based IO controller V9
@ 2009-08-28 21:30 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel


Hi All,

Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch

Changes from V8
===============
- Implemented bdi like congestion semantics for io group also. Now once an
  io group gets congested, we don't clear the congestion flag until number
  of requests goes below nr_congestion_off.

  This helps in getting rid of Buffered write performance regression we
  were observing with io controller patches.

  Gui, can you please test it and see if this version is better in terms
  of your buffered write tests.

- Moved some of the functions from blk-core.c to elevator-fq.c. This reduces
  CONFIG_GROUP_IOSCHED ifdefs in blk-core.c and code looks little more clean. 

- Fixed issue of add_front where we go left on rb-tree if add_front is
  specified in case of preemption.

- Requeue async ioq after one round of dispatch. This helps emulationg
  CFQ behavior.

- Pulled in v11 of io tracking patches and modified config option so that if
  CONFIG_TRACK_ASYNC_CONTEXT is not enabled, blkio is not compiled in.

- Fixed some block tracepoints which were broken because of per group request
  list changes.

- Fixed some logging messages.

- Got rid of extra call to update_prio as pointed out by Jerome and Gui.

- Merged the fix from jerome for a crash while chaning prio.

- Got rid of redundant slice_start assignment as pointed by Gui.

- Merged a elv_ioq_nr_dispatched() cleanup from Gui.

- Fixed a compilation issue if CONFIG_BLOCK=n.
 
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight. 

How to solve the problem
=========================

Different people have solved the issue differetnly. At least there are now
three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here I have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux
IO schedulers as flat where there is one root group and all the IO belongs to
that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I 
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling. 

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

- IO throttling is a max bandwidth controller and not a proportional one.
  Additionaly it provides fairness in terms of amount of IO done (and not in
  terms of disk time as CFQ does).

  Personally, I think that proportional weight controller is useful to more
  people than just max bandwidth controller. In addition, IO scheduler based
  controller can also be enhanced to do max bandwidth control, if need be.

- dm-ioband also provides fairness in terms of amount of IO done not in terms
  of disk time. So a seeky process can still run away with lot more disk time.
  Now this is an interesting question that how fairness among groups should be
  viewed and what is more relevant. Should fairness be based on amount of IO
  done or amount of disk time consumed as CFQ does. IO scheduler based
  controller provides fairness in terms of disk time used.

- IO throttling and dm-ioband both are second level controller. That is these
  controllers are implemented in higher layers than io schedulers. So they
  control the IO at higher layer based on group policies and later IO
  schedulers take care of dispatching these bios to disk.

  Implementing a second level controller has the advantage of being able to
  provide bandwidth control even on logical block devices in the IO stack
  which don't have any IO schedulers attached to these. But they can also 
  interefere with IO scheduling policy of underlying IO scheduler and change
  the effective behavior. Following are some of the issues which I think
  should be visible in second level controller in one form or other.

  Prio with-in group
  ------------------
  A second level controller can potentially interefere with behavior of
  different prio processes with-in a group. bios are buffered at higher layer
  in single queue and release of bios is FIFO and not proportionate to the
  ioprio of the process. This can result in a particular prio level not
  getting fair share.

  Buffering at higher layer can delay read requests for more than slice idle
  period of CFQ (default 8 ms). That means, it is possible that we are waiting
  for a request from the queue but it is buffered at higher layer and then idle
  timer will fire. It means that queue will losse its share at the same time
  overall throughput will be impacted as we lost those 8 ms.
  
  Read Vs Write
  -------------
  Writes can overwhelm readers hence second level controller FIFO release
  will run into issue here. If there is a single queue maintained then reads
  will suffer large latencies. If there separate queues for reads and writes
  then it will be hard to decide in what ratio to dispatch reads and writes as
  it is IO scheduler's decision to decide when and how much read/write to
  dispatch. This is another place where higher level controller will not be in
  sync with lower level io scheduler and can change the effective policies of
  underlying io scheduler.

  Fairness in terms of disk time / size of IO
  ---------------------------------------------
  An higher level controller will most likely be limited to providing fairness
  in terms of size of IO done and will find it hard to provide fairness in
  terms of disk time used (as CFQ provides between various prio levels). This
  is because only IO scheduler knows how much disk time a queue has used.

  Not sure how useful it is to have fairness in terms of secotrs as CFQ has
  been providing fairness in terms of disk time. So a seeky application will
  still run away with lot of disk time and bring down the overall throughput
  of the the disk more than usual.

  CFQ IO context Issues
  ---------------------
  Buffering at higher layer means submission of bios later with the help of
  a worker thread. This changes the io context information at CFQ layer which
  assigns the request to submitting thread. Change of io context info again
  leads to issues of idle timer expiry and issue of a process not getting fair
  share and reduced throughput.

  Throughput with noop, deadline and AS
  ---------------------------------------------
  I think an higher level controller will result in reduced overall throughput
  (as compared to io scheduler based io controller) and more seeks with noop,
  deadline and AS.

  The reason being, that it is likely that IO with-in a group will be related
  and will be relatively close as compared to IO across the groups. For example,
  thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
  control, IO from various groups will go into a single queue at lower level
  controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
  G4....) causing more seeks and reduced throughput. (Agreed that merging will
  help up to some extent but still....).

  Instead, in case of lower level controller, IO scheduler maintains one queue
  per group hence there is no interleaving of IO between groups. And if IO is
  related with-in group, then we shoud get reduced number/amount of seek and
  higher throughput.

  Latency can be a concern but that can be controlled by reducing the time
  slice length of the queue.

- IO scheduler based controller has the limitation that it works only with the
  bottom most devices in the IO stack where IO scheduler is attached. Now the
  question comes that how important/relevant it is to control bandwidth at
  higher level logical devices also. The actual contention for resources is
  at the leaf block device so it probably makes sense to do any kind of
  control there and not at the intermediate devices. Secondly probably it
  also means better use of available resources.

  For example, assume a user has created a linear logical device lv0 using
  three underlying disks sda, sdb and sdc. Also assume there are two tasks
  T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups
  are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

			     T1    T2
			       \   /
			        lv0
			      /  |  \
			    sda sdb  sdc

  Now if IO control is done at lv0 level, then if T1 is doing IO to only sda,
  and T2's IO is going to sdc. In this case there is no need of resource
  management as both the IOs don't have any contention where it matters. If we
  try to do IO control at lv0 device, it will not be an optimal usage of
  resources and will bring down overall throughput.

IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently. But I am all ears to alternative approaches and
suggestions how doing things can be done better.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Open Issues
===========
- Currently for async requests like buffered writes, we get the io group
  information from the page instead of the task context. How important it is
  to determine the context from page?

  Can we put all the pdflush threads into a separate group and control system
  wide buffered write bandwidth. Any buffered writes submitted by the process
  directly will any way go to right group.

  If it is acceptable then we can drop all the code associated with async io
  context and that should simplify the patchset a lot.  

Testing
=======
I have divided testing results in three sections. 

- Latency
- Throughput and Fairness
- Group Fairness

Because I have enhanced CFQ to also do group scheduling, one of the concerns
has been that existing CFQ should not regress at least in flat setup. If
one creates groups and puts tasks in those, then this is new environment and
some properties can change because groups have this additional requirement
of providing isolation also.

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
 
Latency Testing
++++++++++++++++

Test1: fsync-test with torture test from linus as background writer
------------------------------------------------------------
I looked at Ext3 fsync latency thread and picked fsync-test from Theodore Ts'o
and torture test from Linus as background writer to see how are the fsync
completion latencies. Following are the results.

Vanilla CFQ              IOC                    IOC (with map async)
===========             =================        ====================
fsync time: 0.2515      fsync time: 0.8580      fsync time: 0.0531
fsync time: 0.1082      fsync time: 0.1408      fsync time: 0.8907
fsync time: 0.2106      fsync time: 0.3228      fsync time: 0.2709
fsync time: 0.2591      fsync time: 0.0978      fsync time: 0.3198
fsync time: 0.2776      fsync time: 0.3035      fsync time: 0.0886
fsync time: 0.2530      fsync time: 0.0903      fsync time: 0.3035
fsync time: 0.2271      fsync time: 0.2712      fsync time: 0.0961
fsync time: 0.1057      fsync time: 0.3357      fsync time: 0.1048
fsync time: 0.1699      fsync time: 0.3175      fsync time: 0.2582
fsync time: 0.1923      fsync time: 0.2964      fsync time: 0.0876
fsync time: 0.1805      fsync time: 0.0971      fsync time: 0.2546
fsync time: 0.2944      fsync time: 0.2728      fsync time: 0.3059
fsync time: 0.1420      fsync time: 0.1079      fsync time: 0.2973
fsync time: 0.2650      fsync time: 0.3103      fsync time: 0.2032
fsync time: 0.1581      fsync time: 0.1987      fsync time: 0.2926
fsync time: 0.2656      fsync time: 0.3048      fsync time: 0.1934
fsync time: 0.2666      fsync time: 0.3092      fsync time: 0.2954
fsync time: 0.1272      fsync time: 0.0165      fsync time: 0.2952
fsync time: 0.2655      fsync time: 0.2827      fsync time: 0.2394
fsync time: 0.0147      fsync time: 0.0068      fsync time: 0.0454
fsync time: 0.2296      fsync time: 0.2923      fsync time: 0.2936
fsync time: 0.0069      fsync time: 0.3021      fsync time: 0.0397
fsync time: 0.2668      fsync time: 0.1032      fsync time: 0.2762
fsync time: 0.1932      fsync time: 0.0962      fsync time: 0.2946
fsync time: 0.1895      fsync time: 0.3545      fsync time: 0.0774
fsync time: 0.2577      fsync time: 0.2406      fsync time: 0.3027
fsync time: 0.4935      fsync time: 0.7193      fsync time: 0.2984
fsync time: 0.2804      fsync time: 0.3251      fsync time: 0.1057
fsync time: 0.2685      fsync time: 0.1001      fsync time: 0.3145
fsync time: 0.1946      fsync time: 0.2525      fsync time: 0.2992

IOC--> With IO controller patches applied. CONFIG_TRACK_ASYNC_CONTEXT=n
IOC(map async) --> IO controller patches with CONFIG_TRACK_ASYNC_CONTEXT=y

If CONFIG_TRACK_ASYNC_CONTEXT=y, async requests are mapped to the group based
on cgroup info stored in page otherwise these are mapped to the cgroup
submitting task belongs to.

Notes: 
- It looks like that max fsync time is a bit higher with IO controller
  patches. Wil dig more into it later.

Test2: read small files with multiple sequential readers (10) runnning
======================================================================
Took Ingo's small file reader test and ran it while 10 sequential readers
were running.

Vanilla CFQ     IOC (flat)      IOC (10 readers in 10 groups)
0.12 seconds    0.11 seconds    1.62 seconds
0.05 seconds    0.05 seconds    1.18 seconds
0.05 seconds    0.05 seconds    1.17 seconds
0.03 seconds    0.04 seconds    1.18 seconds
1.15 seconds    1.17 seconds    1.29 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.18 seconds    1.18 seconds
1.15 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
0.04 seconds    0.04 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.18 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.18 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.15 seconds    1.17 seconds
1.15 seconds    1.15 seconds    1.18 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds

In third column, 10 readers have been put into 10 groups instead of running
into root group. Small file reader runs in to root group.

Notes: It looks like that here read latencies remain same as with vanilla CFQ.

Test3: read small files with multiple writers (8) runnning
==========================================================
Again running small file reader test with 8 buffered writers running with
prio 0 to 7.

Latency results are in seconds. Tried to capture the output with multiple
configurations of IO controller to see the effect.

Vanilla  IOC     IOC     IOC     IOC    IOC     IOC
        (flat)(groups) (groups) (map)  (map)    (map)
                (f=0)   (f=1)   (flat) (groups) (groups)
                                        (f=0)   (f=1)
0.25    0.03    0.31    0.25    0.29    1.25    0.39
0.27    0.28    0.28    0.30    0.41    0.90    0.80
0.25    0.24    0.23    0.37    0.27    1.17    0.24
0.14    0.14    0.14    0.13    0.15    0.10    1.11
0.14    0.16    0.13    0.16    0.15    0.06    0.58
0.16    0.11    0.15    0.12    0.19    0.05    0.14
0.03    0.17    0.12    0.17    0.04    0.12    0.12
0.13    0.13    0.13    0.14    0.03    0.05    0.05
0.18    0.13    0.17    0.09    0.09    0.05    0.07
0.11    0.18    0.16    0.18    0.14    0.05    0.12
0.28    0.14    0.15    0.15    0.13    0.02    0.04
0.16    0.14    0.14    0.12    0.15    0.00    0.13
0.14    0.13    0.14    0.13    0.13    0.02    0.02
0.13    0.11    0.12    0.14    0.15    0.06    0.01
0.27    0.28    0.32    0.24    0.25    0.01    0.01
0.14    0.15    0.18    0.15    0.13    0.06    0.02
0.15    0.13    0.13    0.13    0.13    0.00    0.04
0.15    0.13    0.15    0.14    0.15    0.01    0.05
0.11    0.17    0.15    0.13    0.13    0.02    0.00
0.17    0.13    0.17    0.12    0.18    0.39    0.01
0.18    0.16    0.14    0.16    0.14    0.89    0.47
0.13    0.13    0.14    0.04    0.12    0.64    0.78
0.16    0.15    0.19    0.11    0.16    0.67    1.17
0.04    0.12    0.14    0.04    0.18    0.67    0.63
0.03    0.13    0.17    0.11    0.15    0.61    0.69
0.15    0.16    0.13    0.14    0.13    0.77    0.66
0.12    0.12    0.15    0.11    0.13    0.92    0.73
0.15    0.12    0.15    0.16    0.13    0.70    0.73
0.11    0.13    0.15    0.10    0.18    0.73    0.82
0.16    0.19    0.15    0.16    0.14    0.71    0.74
0.28    0.05    0.26    0.22    0.17    2.91    0.79
0.13    0.05    0.14    0.14    0.14    0.44    0.65
0.16    0.22    0.18    0.13    0.26    0.31    0.65
0.10    0.13    0.12    0.11    0.16    0.25    0.66
0.13    0.14    0.16    0.15    0.12    0.17    0.76
0.19    0.11    0.12    0.14    0.17    0.20    0.71
0.16    0.15    0.14    0.15    0.11    0.19    0.68
0.13    0.13    0.13    0.13    0.16    0.04    0.78
0.14    0.16    0.15    0.17    0.15    1.20    0.80
0.17    0.13    0.14    0.18    0.14    0.76    0.63

f(0/1)--> refers to "fairness" tunable. This is new tunable part of CFQ. It
  	  set, we wait for requests from one queue to finish before new
	  queue is scheduled in.

group ---> writers are running into individual groups and not in root group.
map---> buffered writes are mapped to group using info stored in page.

Notes: Except the case of column 6 and 7 when writeres are in separate groups
and we are mapping their writes to respective group, latencies seem to be
fine. I think the latencies are higher for the last two cases because
now the reader can't preempt the writer.

				root
			       / \  \ \
			      R  G1 G2 G3
				 |  |  |
				 W  W  W
Test4: Random Reader test in presece of 4 sequential readers and 4 buffered
       writers
============================================================================
Used fio to this time to run one random reader and see how does it fair in
the presence of 4 sequential readers and 4 writers.

I have just pasted the output of random reader from fio.

Vanilla Kernel, Three runs
--------------------------
read : io=20,512KiB, bw=349KiB/s, iops=10, runt= 60075msec
clat (usec): min=944, max=2,675K, avg=93715.04, stdev=305815.90

read : io=13,696KiB, bw=233KiB/s, iops=7, runt= 60035msec
clat (msec): min=2, max=1,812, avg=140.26, stdev=382.55

read : io=13,824KiB, bw=235KiB/s, iops=7, runt= 60185msec
clat (usec): min=766, max=2,025K, avg=139310.55, stdev=383647.54

IO controller kernel, Three runs
--------------------------------
read : io=10,304KiB, bw=175KiB/s, iops=5, runt= 60083msec
clat (msec): min=2, max=2,654, avg=186.59, stdev=524.08

read : io=10,176KiB, bw=173KiB/s, iops=5, runt= 60054msec
clat (usec): min=792, max=2,567K, avg=188841.70, stdev=517154.75

read : io=11,040KiB, bw=188KiB/s, iops=5, runt= 60003msec
clat (usec): min=779, max=2,625K, avg=173915.56, stdev=508118.60

Notes:
- Looks like vanilla CFQ gives a bit more disk access to random reader. Will
  dig into it.

Throughput and Fairness
+++++++++++++++++++++++
Test5: Bandwidth distribution between 4 sequential readers and 4 buffered
       writers
==========================================================================
Used fio to launch 4 sequential readers and 4 buffered writers and watched
how BW is distributed.

Vanilla kernel, Three sets
--------------------------
read : io=962MiB, bw=16,818KiB/s, iops=513, runt= 60008msec
read : io=969MiB, bw=16,920KiB/s, iops=516, runt= 60077msec
read : io=978MiB, bw=17,063KiB/s, iops=520, runt= 60096msec
read : io=922MiB, bw=16,106KiB/s, iops=491, runt= 60057msec
write: io=235MiB, bw=4,099KiB/s, iops=125, runt= 60049msec
write: io=226MiB, bw=3,944KiB/s, iops=120, runt= 60049msec
write: io=215MiB, bw=3,747KiB/s, iops=114, runt= 60049msec
write: io=207MiB, bw=3,606KiB/s, iops=110, runt= 60049msec
READ: io=3,832MiB, aggrb=66,868KiB/s, minb=16,106KiB/s, maxb=17,063KiB/s,
mint=60008msec, maxt=60096msec
WRITE: io=882MiB, aggrb=15,398KiB/s, minb=3,606KiB/s, maxb=4,099KiB/s,
mint=60049msec, maxt=60049msec

read : io=1,002MiB, bw=17,513KiB/s, iops=534, runt= 60020msec
read : io=979MiB, bw=17,085KiB/s, iops=521, runt= 60080msec
read : io=953MiB, bw=16,637KiB/s, iops=507, runt= 60092msec
read : io=920MiB, bw=16,057KiB/s, iops=490, runt= 60108msec
write: io=215MiB, bw=3,560KiB/s, iops=108, runt= 63289msec
write: io=136MiB, bw=2,361KiB/s, iops=72, runt= 60502msec
write: io=127MiB, bw=2,101KiB/s, iops=64, runt= 63289msec
write: io=233MiB, bw=3,852KiB/s, iops=117, runt= 63289msec
READ: io=3,855MiB, aggrb=67,256KiB/s, minb=16,057KiB/s, maxb=17,513KiB/s,
mint=60020msec, maxt=60108msec
WRITE: io=711MiB, aggrb=11,771KiB/s, minb=2,101KiB/s, maxb=3,852KiB/s,
mint=60502msec, maxt=63289msec

read : io=985MiB, bw=17,179KiB/s, iops=524, runt= 60149msec
read : io=974MiB, bw=17,025KiB/s, iops=519, runt= 60002msec
read : io=962MiB, bw=16,772KiB/s, iops=511, runt= 60170msec
read : io=932MiB, bw=16,280KiB/s, iops=496, runt= 60057msec
write: io=177MiB, bw=2,933KiB/s, iops=89, runt= 63094msec
write: io=152MiB, bw=2,637KiB/s, iops=80, runt= 60323msec
write: io=240MiB, bw=3,983KiB/s, iops=121, runt= 63094msec
write: io=147MiB, bw=2,439KiB/s, iops=74, runt= 63094msec
READ: io=3,855MiB, aggrb=67,174KiB/s, minb=16,280KiB/s, maxb=17,179KiB/s,
mint=60002msec, maxt=60170msec
WRITE: io=715MiB, aggrb=11,877KiB/s, minb=2,439KiB/s, maxb=3,983KiB/s,
mint=60323msec, maxt=63094msec

IO controller kernel three sets
-------------------------------
read : io=944MiB, bw=16,483KiB/s, iops=503, runt= 60055msec
read : io=941MiB, bw=16,433KiB/s, iops=501, runt= 60073msec
read : io=900MiB, bw=15,713KiB/s, iops=479, runt= 60040msec
read : io=866MiB, bw=15,112KiB/s, iops=461, runt= 60086msec
write: io=244MiB, bw=4,262KiB/s, iops=130, runt= 60040msec
write: io=177MiB, bw=3,085KiB/s, iops=94, runt= 60042msec
write: io=158MiB, bw=2,758KiB/s, iops=84, runt= 60041msec
write: io=180MiB, bw=3,137KiB/s, iops=95, runt= 60040msec
READ: io=3,651MiB, aggrb=63,718KiB/s, minb=15,112KiB/s, maxb=16,483KiB/s,
mint=60040msec, maxt=60086msec
WRITE: io=758MiB, aggrb=13,243KiB/s, minb=2,758KiB/s, maxb=4,262KiB/s,
mint=60040msec, maxt=60042msec

read : io=960MiB, bw=16,734KiB/s, iops=510, runt= 60137msec
read : io=917MiB, bw=16,001KiB/s, iops=488, runt= 60122msec
read : io=897MiB, bw=15,683KiB/s, iops=478, runt= 60004msec
read : io=908MiB, bw=15,824KiB/s, iops=482, runt= 60149msec
write: io=209MiB, bw=3,563KiB/s, iops=108, runt= 61400msec
write: io=177MiB, bw=3,030KiB/s, iops=92, runt= 61400msec
write: io=200MiB, bw=3,409KiB/s, iops=104, runt= 61400msec
write: io=204MiB, bw=3,489KiB/s, iops=106, runt= 61400msec
READ: io=3,682MiB, aggrb=64,194KiB/s, minb=15,683KiB/s, maxb=16,734KiB/s,
mint=60004msec, maxt=60149msec
WRITE: io=790MiB, aggrb=13,492KiB/s, minb=3,030KiB/s, maxb=3,563KiB/s,
mint=61400msec, maxt=61400msec

read : io=968MiB, bw=16,867KiB/s, iops=514, runt= 60158msec
read : io=925MiB, bw=16,135KiB/s, iops=492, runt= 60142msec
read : io=875MiB, bw=15,286KiB/s, iops=466, runt= 60003msec
read : io=872MiB, bw=15,221KiB/s, iops=464, runt= 60049msec
write: io=213MiB, bw=3,720KiB/s, iops=113, runt= 60162msec
write: io=203MiB, bw=3,536KiB/s, iops=107, runt= 60163msec
write: io=208MiB, bw=3,620KiB/s, iops=110, runt= 60162msec
write: io=203MiB, bw=3,538KiB/s, iops=107, runt= 60163msec
READ: io=3,640MiB, aggrb=63,439KiB/s, minb=15,221KiB/s, maxb=16,867KiB/s,
mint=60003msec, maxt=60158msec
WRITE: io=827MiB, aggrb=14,415KiB/s, minb=3,536KiB/s, maxb=3,720KiB/s,
mint=60162msec, maxt=60163msec

Notes: It looks like vanilla CFQ favors readers a bit more over writers as
       compared to io controller cfq. Will dig into it.
	 
Test6: Bandwidth distribution between readers of diff prio
==========================================================
Using fio, ran 8 readers of prio 0 to 7 and let it run for 30 seconds and
watched for overall throughput and who got how much IO done. 

Vanilla kernel, Three sets
---------------------------
read : io=454MiB, bw=15,865KiB/s, iops=484, runt= 30004msec
read : io=382MiB, bw=13,330KiB/s, iops=406, runt= 30086msec
read : io=325MiB, bw=11,330KiB/s, iops=345, runt= 30074msec
read : io=294MiB, bw=10,253KiB/s, iops=312, runt= 30062msec
read : io=238MiB, bw=8,321KiB/s, iops=253, runt= 30048msec
read : io=145MiB, bw=5,061KiB/s, iops=154, runt= 30032msec
read : io=99MiB, bw=3,456KiB/s, iops=105, runt= 30021msec
read : io=67,040KiB, bw=2,280KiB/s, iops=69, runt= 30108msec
READ: io=2,003MiB, aggrb=69,767KiB/s, minb=2,280KiB/s, maxb=15,865KiB/s,
mint=30004msec, maxt=30108msec

read : io=450MiB, bw=15,727KiB/s, iops=479, runt= 30001msec
read : io=371MiB, bw=12,966KiB/s, iops=395, runt= 30040msec
read : io=325MiB, bw=11,321KiB/s, iops=345, runt= 30099msec
read : io=296MiB, bw=10,332KiB/s, iops=315, runt= 30086msec
read : io=238MiB, bw=8,319KiB/s, iops=253, runt= 30056msec
read : io=152MiB, bw=5,290KiB/s, iops=161, runt= 30070msec
read : io=100MiB, bw=3,483KiB/s, iops=106, runt= 30020msec
read : io=68,832KiB, bw=2,340KiB/s, iops=71, runt= 30118msec
READ: io=2,000MiB, aggrb=69,631KiB/s, minb=2,340KiB/s, maxb=15,727KiB/s,
mint=30001msec, maxt=30118msec

read : io=450MiB, bw=15,691KiB/s, iops=478, runt= 30068msec
read : io=369MiB, bw=12,882KiB/s, iops=393, runt= 30032msec
read : io=364MiB, bw=12,732KiB/s, iops=388, runt= 30015msec
read : io=283MiB, bw=9,889KiB/s, iops=301, runt= 30002msec
read : io=228MiB, bw=7,935KiB/s, iops=242, runt= 30091msec
read : io=144MiB, bw=5,018KiB/s, iops=153, runt= 30103msec
read : io=97,760KiB, bw=3,327KiB/s, iops=101, runt= 30083msec
read : io=66,784KiB, bw=2,276KiB/s, iops=69, runt= 30046msec
READ: io=1,999MiB, aggrb=69,625KiB/s, minb=2,276KiB/s, maxb=15,691KiB/s,
mint=30002msec, maxt=30103msec

IO controller kernel, Three sets
--------------------------------
read : io=404MiB, bw=14,103KiB/s, iops=430, runt= 30072msec
read : io=344MiB, bw=11,999KiB/s, iops=366, runt= 30035msec
read : io=294MiB, bw=10,257KiB/s, iops=313, runt= 30052msec
read : io=254MiB, bw=8,888KiB/s, iops=271, runt= 30021msec
read : io=238MiB, bw=8,311KiB/s, iops=253, runt= 30086msec
read : io=177MiB, bw=6,202KiB/s, iops=189, runt= 30001msec
read : io=158MiB, bw=5,517KiB/s, iops=168, runt= 30118msec
read : io=99MiB, bw=3,464KiB/s, iops=105, runt= 30107msec
READ: io=1,971MiB, aggrb=68,604KiB/s, minb=3,464KiB/s, maxb=14,103KiB/s,
mint=30001msec, maxt=30118msec

read : io=375MiB, bw=13,066KiB/s, iops=398, runt= 30110msec
read : io=326MiB, bw=11,409KiB/s, iops=348, runt= 30003msec
read : io=308MiB, bw=10,758KiB/s, iops=328, runt= 30066msec
read : io=256MiB, bw=8,937KiB/s, iops=272, runt= 30091msec
read : io=232MiB, bw=8,088KiB/s, iops=246, runt= 30041msec
read : io=192MiB, bw=6,695KiB/s, iops=204, runt= 30077msec
read : io=144MiB, bw=5,014KiB/s, iops=153, runt= 30051msec
read : io=96,224KiB, bw=3,281KiB/s, iops=100, runt= 30026msec
READ: io=1,928MiB, aggrb=67,145KiB/s, minb=3,281KiB/s, maxb=13,066KiB/s,
mint=30003msec, maxt=30110msec

read : io=405MiB, bw=14,162KiB/s, iops=432, runt= 30021msec
read : io=354MiB, bw=12,386KiB/s, iops=378, runt= 30007msec
read : io=303MiB, bw=10,567KiB/s, iops=322, runt= 30062msec
read : io=261MiB, bw=9,126KiB/s, iops=278, runt= 30040msec
read : io=228MiB, bw=7,946KiB/s, iops=242, runt= 30048msec
read : io=178MiB, bw=6,222KiB/s, iops=189, runt= 30074msec
read : io=152MiB, bw=5,286KiB/s, iops=161, runt= 30093msec
read : io=99MiB, bw=3,446KiB/s, iops=105, runt= 30110msec
READ: io=1,981MiB, aggrb=68,996KiB/s, minb=3,446KiB/s, maxb=14,162KiB/s,
mint=30007msec, maxt=30110msec

Notes:
- It looks like overall throughput is 1-3% less in case of io controller.
- Bandwidth distribution between various prio levels has changed a bit. CFQ
  seems to have 100ms slice length for prio4 and then this slice increases
  by 20% for each prio level as prio increases and decreases by 20% as prio
  levels decrease. So Io controller does not seem to be doing too bad as in
  meeting that distribution.

Group Fairness
+++++++++++++++
Test7 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test8 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

  Higher weight dd finishes first and at that point of time my script takes
  care of reading cgroup files io.disk_time and io.disk_sectors for both the
  groups and display the results.

  dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
  dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

  group1 time=8:16 2452 group1 sectors=8:16 457856
  group2 time=8:16 1317 group2 sectors=8:16 247008

  234179072 bytes (234 MB) copied, 3.90912 s, 59.9 MB/s
  234179072 bytes (234 MB) copied, 5.15548 s, 45.4 MB/s

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test9 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.

First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.

sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &

Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)

Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.disk_time and io.disk_sectors cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for. 

For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".

sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo  1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------

Results
=======
268435456 bytes (268 MB) copied, 6.92248 s, 38.8 MB/s

group1 time=8:16 3185 group1 sectors=8:16 524824
group2 time=8:16 3190 group2 sectors=8:16 503848

Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.

Test10 (AIO)
===========

AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight

fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Results
------

test1 statistics: time=8:16 17955   sectors=8:16 1049656 dq=8:16 2
test2 statistics: time=8:16 9217   sectors=8:16 602592 dq=8:16 1

Above shows that by the time first fio (higher weight), finished, group
test1 got 17686 ms of disk time and group test2 got 9036 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"

echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Following are the results.

test1 statistics: time=8:16 25452   sectors=8:16 1049664 dq=8:16 2
test2 statistics: time=8:16 12939   sectors=8:16 532184 dq=8:16 4

Above shows that by the time first fio (higher weight), finished, group
test1 got almost double the disk time of group test2.

Test11 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [RFC] IO scheduler based IO controller V9
@ 2009-08-28 21:30 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds


Hi All,

Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch

Changes from V8
===============
- Implemented bdi like congestion semantics for io group also. Now once an
  io group gets congested, we don't clear the congestion flag until number
  of requests goes below nr_congestion_off.

  This helps in getting rid of Buffered write performance regression we
  were observing with io controller patches.

  Gui, can you please test it and see if this version is better in terms
  of your buffered write tests.

- Moved some of the functions from blk-core.c to elevator-fq.c. This reduces
  CONFIG_GROUP_IOSCHED ifdefs in blk-core.c and code looks little more clean. 

- Fixed issue of add_front where we go left on rb-tree if add_front is
  specified in case of preemption.

- Requeue async ioq after one round of dispatch. This helps emulationg
  CFQ behavior.

- Pulled in v11 of io tracking patches and modified config option so that if
  CONFIG_TRACK_ASYNC_CONTEXT is not enabled, blkio is not compiled in.

- Fixed some block tracepoints which were broken because of per group request
  list changes.

- Fixed some logging messages.

- Got rid of extra call to update_prio as pointed out by Jerome and Gui.

- Merged the fix from jerome for a crash while chaning prio.

- Got rid of redundant slice_start assignment as pointed by Gui.

- Merged a elv_ioq_nr_dispatched() cleanup from Gui.

- Fixed a compilation issue if CONFIG_BLOCK=n.
 
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight. 

How to solve the problem
=========================

Different people have solved the issue differetnly. At least there are now
three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here I have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux
IO schedulers as flat where there is one root group and all the IO belongs to
that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I 
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling. 

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

- IO throttling is a max bandwidth controller and not a proportional one.
  Additionaly it provides fairness in terms of amount of IO done (and not in
  terms of disk time as CFQ does).

  Personally, I think that proportional weight controller is useful to more
  people than just max bandwidth controller. In addition, IO scheduler based
  controller can also be enhanced to do max bandwidth control, if need be.

- dm-ioband also provides fairness in terms of amount of IO done not in terms
  of disk time. So a seeky process can still run away with lot more disk time.
  Now this is an interesting question that how fairness among groups should be
  viewed and what is more relevant. Should fairness be based on amount of IO
  done or amount of disk time consumed as CFQ does. IO scheduler based
  controller provides fairness in terms of disk time used.

- IO throttling and dm-ioband both are second level controller. That is these
  controllers are implemented in higher layers than io schedulers. So they
  control the IO at higher layer based on group policies and later IO
  schedulers take care of dispatching these bios to disk.

  Implementing a second level controller has the advantage of being able to
  provide bandwidth control even on logical block devices in the IO stack
  which don't have any IO schedulers attached to these. But they can also 
  interefere with IO scheduling policy of underlying IO scheduler and change
  the effective behavior. Following are some of the issues which I think
  should be visible in second level controller in one form or other.

  Prio with-in group
  ------------------
  A second level controller can potentially interefere with behavior of
  different prio processes with-in a group. bios are buffered at higher layer
  in single queue and release of bios is FIFO and not proportionate to the
  ioprio of the process. This can result in a particular prio level not
  getting fair share.

  Buffering at higher layer can delay read requests for more than slice idle
  period of CFQ (default 8 ms). That means, it is possible that we are waiting
  for a request from the queue but it is buffered at higher layer and then idle
  timer will fire. It means that queue will losse its share at the same time
  overall throughput will be impacted as we lost those 8 ms.
  
  Read Vs Write
  -------------
  Writes can overwhelm readers hence second level controller FIFO release
  will run into issue here. If there is a single queue maintained then reads
  will suffer large latencies. If there separate queues for reads and writes
  then it will be hard to decide in what ratio to dispatch reads and writes as
  it is IO scheduler's decision to decide when and how much read/write to
  dispatch. This is another place where higher level controller will not be in
  sync with lower level io scheduler and can change the effective policies of
  underlying io scheduler.

  Fairness in terms of disk time / size of IO
  ---------------------------------------------
  An higher level controller will most likely be limited to providing fairness
  in terms of size of IO done and will find it hard to provide fairness in
  terms of disk time used (as CFQ provides between various prio levels). This
  is because only IO scheduler knows how much disk time a queue has used.

  Not sure how useful it is to have fairness in terms of secotrs as CFQ has
  been providing fairness in terms of disk time. So a seeky application will
  still run away with lot of disk time and bring down the overall throughput
  of the the disk more than usual.

  CFQ IO context Issues
  ---------------------
  Buffering at higher layer means submission of bios later with the help of
  a worker thread. This changes the io context information at CFQ layer which
  assigns the request to submitting thread. Change of io context info again
  leads to issues of idle timer expiry and issue of a process not getting fair
  share and reduced throughput.

  Throughput with noop, deadline and AS
  ---------------------------------------------
  I think an higher level controller will result in reduced overall throughput
  (as compared to io scheduler based io controller) and more seeks with noop,
  deadline and AS.

  The reason being, that it is likely that IO with-in a group will be related
  and will be relatively close as compared to IO across the groups. For example,
  thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
  control, IO from various groups will go into a single queue at lower level
  controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
  G4....) causing more seeks and reduced throughput. (Agreed that merging will
  help up to some extent but still....).

  Instead, in case of lower level controller, IO scheduler maintains one queue
  per group hence there is no interleaving of IO between groups. And if IO is
  related with-in group, then we shoud get reduced number/amount of seek and
  higher throughput.

  Latency can be a concern but that can be controlled by reducing the time
  slice length of the queue.

- IO scheduler based controller has the limitation that it works only with the
  bottom most devices in the IO stack where IO scheduler is attached. Now the
  question comes that how important/relevant it is to control bandwidth at
  higher level logical devices also. The actual contention for resources is
  at the leaf block device so it probably makes sense to do any kind of
  control there and not at the intermediate devices. Secondly probably it
  also means better use of available resources.

  For example, assume a user has created a linear logical device lv0 using
  three underlying disks sda, sdb and sdc. Also assume there are two tasks
  T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups
  are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

			     T1    T2
			       \   /
			        lv0
			      /  |  \
			    sda sdb  sdc

  Now if IO control is done at lv0 level, then if T1 is doing IO to only sda,
  and T2's IO is going to sdc. In this case there is no need of resource
  management as both the IOs don't have any contention where it matters. If we
  try to do IO control at lv0 device, it will not be an optimal usage of
  resources and will bring down overall throughput.

IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently. But I am all ears to alternative approaches and
suggestions how doing things can be done better.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Open Issues
===========
- Currently for async requests like buffered writes, we get the io group
  information from the page instead of the task context. How important it is
  to determine the context from page?

  Can we put all the pdflush threads into a separate group and control system
  wide buffered write bandwidth. Any buffered writes submitted by the process
  directly will any way go to right group.

  If it is acceptable then we can drop all the code associated with async io
  context and that should simplify the patchset a lot.  

Testing
=======
I have divided testing results in three sections. 

- Latency
- Throughput and Fairness
- Group Fairness

Because I have enhanced CFQ to also do group scheduling, one of the concerns
has been that existing CFQ should not regress at least in flat setup. If
one creates groups and puts tasks in those, then this is new environment and
some properties can change because groups have this additional requirement
of providing isolation also.

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
 
Latency Testing
++++++++++++++++

Test1: fsync-test with torture test from linus as background writer
------------------------------------------------------------
I looked at Ext3 fsync latency thread and picked fsync-test from Theodore Ts'o
and torture test from Linus as background writer to see how are the fsync
completion latencies. Following are the results.

Vanilla CFQ              IOC                    IOC (with map async)
===========             =================        ====================
fsync time: 0.2515      fsync time: 0.8580      fsync time: 0.0531
fsync time: 0.1082      fsync time: 0.1408      fsync time: 0.8907
fsync time: 0.2106      fsync time: 0.3228      fsync time: 0.2709
fsync time: 0.2591      fsync time: 0.0978      fsync time: 0.3198
fsync time: 0.2776      fsync time: 0.3035      fsync time: 0.0886
fsync time: 0.2530      fsync time: 0.0903      fsync time: 0.3035
fsync time: 0.2271      fsync time: 0.2712      fsync time: 0.0961
fsync time: 0.1057      fsync time: 0.3357      fsync time: 0.1048
fsync time: 0.1699      fsync time: 0.3175      fsync time: 0.2582
fsync time: 0.1923      fsync time: 0.2964      fsync time: 0.0876
fsync time: 0.1805      fsync time: 0.0971      fsync time: 0.2546
fsync time: 0.2944      fsync time: 0.2728      fsync time: 0.3059
fsync time: 0.1420      fsync time: 0.1079      fsync time: 0.2973
fsync time: 0.2650      fsync time: 0.3103      fsync time: 0.2032
fsync time: 0.1581      fsync time: 0.1987      fsync time: 0.2926
fsync time: 0.2656      fsync time: 0.3048      fsync time: 0.1934
fsync time: 0.2666      fsync time: 0.3092      fsync time: 0.2954
fsync time: 0.1272      fsync time: 0.0165      fsync time: 0.2952
fsync time: 0.2655      fsync time: 0.2827      fsync time: 0.2394
fsync time: 0.0147      fsync time: 0.0068      fsync time: 0.0454
fsync time: 0.2296      fsync time: 0.2923      fsync time: 0.2936
fsync time: 0.0069      fsync time: 0.3021      fsync time: 0.0397
fsync time: 0.2668      fsync time: 0.1032      fsync time: 0.2762
fsync time: 0.1932      fsync time: 0.0962      fsync time: 0.2946
fsync time: 0.1895      fsync time: 0.3545      fsync time: 0.0774
fsync time: 0.2577      fsync time: 0.2406      fsync time: 0.3027
fsync time: 0.4935      fsync time: 0.7193      fsync time: 0.2984
fsync time: 0.2804      fsync time: 0.3251      fsync time: 0.1057
fsync time: 0.2685      fsync time: 0.1001      fsync time: 0.3145
fsync time: 0.1946      fsync time: 0.2525      fsync time: 0.2992

IOC--> With IO controller patches applied. CONFIG_TRACK_ASYNC_CONTEXT=n
IOC(map async) --> IO controller patches with CONFIG_TRACK_ASYNC_CONTEXT=y

If CONFIG_TRACK_ASYNC_CONTEXT=y, async requests are mapped to the group based
on cgroup info stored in page otherwise these are mapped to the cgroup
submitting task belongs to.

Notes: 
- It looks like that max fsync time is a bit higher with IO controller
  patches. Wil dig more into it later.

Test2: read small files with multiple sequential readers (10) runnning
======================================================================
Took Ingo's small file reader test and ran it while 10 sequential readers
were running.

Vanilla CFQ     IOC (flat)      IOC (10 readers in 10 groups)
0.12 seconds    0.11 seconds    1.62 seconds
0.05 seconds    0.05 seconds    1.18 seconds
0.05 seconds    0.05 seconds    1.17 seconds
0.03 seconds    0.04 seconds    1.18 seconds
1.15 seconds    1.17 seconds    1.29 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.18 seconds    1.18 seconds
1.15 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
0.04 seconds    0.04 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.18 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.18 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.15 seconds    1.17 seconds
1.15 seconds    1.15 seconds    1.18 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds

In third column, 10 readers have been put into 10 groups instead of running
into root group. Small file reader runs in to root group.

Notes: It looks like that here read latencies remain same as with vanilla CFQ.

Test3: read small files with multiple writers (8) runnning
==========================================================
Again running small file reader test with 8 buffered writers running with
prio 0 to 7.

Latency results are in seconds. Tried to capture the output with multiple
configurations of IO controller to see the effect.

Vanilla  IOC     IOC     IOC     IOC    IOC     IOC
        (flat)(groups) (groups) (map)  (map)    (map)
                (f=0)   (f=1)   (flat) (groups) (groups)
                                        (f=0)   (f=1)
0.25    0.03    0.31    0.25    0.29    1.25    0.39
0.27    0.28    0.28    0.30    0.41    0.90    0.80
0.25    0.24    0.23    0.37    0.27    1.17    0.24
0.14    0.14    0.14    0.13    0.15    0.10    1.11
0.14    0.16    0.13    0.16    0.15    0.06    0.58
0.16    0.11    0.15    0.12    0.19    0.05    0.14
0.03    0.17    0.12    0.17    0.04    0.12    0.12
0.13    0.13    0.13    0.14    0.03    0.05    0.05
0.18    0.13    0.17    0.09    0.09    0.05    0.07
0.11    0.18    0.16    0.18    0.14    0.05    0.12
0.28    0.14    0.15    0.15    0.13    0.02    0.04
0.16    0.14    0.14    0.12    0.15    0.00    0.13
0.14    0.13    0.14    0.13    0.13    0.02    0.02
0.13    0.11    0.12    0.14    0.15    0.06    0.01
0.27    0.28    0.32    0.24    0.25    0.01    0.01
0.14    0.15    0.18    0.15    0.13    0.06    0.02
0.15    0.13    0.13    0.13    0.13    0.00    0.04
0.15    0.13    0.15    0.14    0.15    0.01    0.05
0.11    0.17    0.15    0.13    0.13    0.02    0.00
0.17    0.13    0.17    0.12    0.18    0.39    0.01
0.18    0.16    0.14    0.16    0.14    0.89    0.47
0.13    0.13    0.14    0.04    0.12    0.64    0.78
0.16    0.15    0.19    0.11    0.16    0.67    1.17
0.04    0.12    0.14    0.04    0.18    0.67    0.63
0.03    0.13    0.17    0.11    0.15    0.61    0.69
0.15    0.16    0.13    0.14    0.13    0.77    0.66
0.12    0.12    0.15    0.11    0.13    0.92    0.73
0.15    0.12    0.15    0.16    0.13    0.70    0.73
0.11    0.13    0.15    0.10    0.18    0.73    0.82
0.16    0.19    0.15    0.16    0.14    0.71    0.74
0.28    0.05    0.26    0.22    0.17    2.91    0.79
0.13    0.05    0.14    0.14    0.14    0.44    0.65
0.16    0.22    0.18    0.13    0.26    0.31    0.65
0.10    0.13    0.12    0.11    0.16    0.25    0.66
0.13    0.14    0.16    0.15    0.12    0.17    0.76
0.19    0.11    0.12    0.14    0.17    0.20    0.71
0.16    0.15    0.14    0.15    0.11    0.19    0.68
0.13    0.13    0.13    0.13    0.16    0.04    0.78
0.14    0.16    0.15    0.17    0.15    1.20    0.80
0.17    0.13    0.14    0.18    0.14    0.76    0.63

f(0/1)--> refers to "fairness" tunable. This is new tunable part of CFQ. It
  	  set, we wait for requests from one queue to finish before new
	  queue is scheduled in.

group ---> writers are running into individual groups and not in root group.
map---> buffered writes are mapped to group using info stored in page.

Notes: Except the case of column 6 and 7 when writeres are in separate groups
and we are mapping their writes to respective group, latencies seem to be
fine. I think the latencies are higher for the last two cases because
now the reader can't preempt the writer.

				root
			       / \  \ \
			      R  G1 G2 G3
				 |  |  |
				 W  W  W
Test4: Random Reader test in presece of 4 sequential readers and 4 buffered
       writers
============================================================================
Used fio to this time to run one random reader and see how does it fair in
the presence of 4 sequential readers and 4 writers.

I have just pasted the output of random reader from fio.

Vanilla Kernel, Three runs
--------------------------
read : io=20,512KiB, bw=349KiB/s, iops=10, runt= 60075msec
clat (usec): min=944, max=2,675K, avg=93715.04, stdev=305815.90

read : io=13,696KiB, bw=233KiB/s, iops=7, runt= 60035msec
clat (msec): min=2, max=1,812, avg=140.26, stdev=382.55

read : io=13,824KiB, bw=235KiB/s, iops=7, runt= 60185msec
clat (usec): min=766, max=2,025K, avg=139310.55, stdev=383647.54

IO controller kernel, Three runs
--------------------------------
read : io=10,304KiB, bw=175KiB/s, iops=5, runt= 60083msec
clat (msec): min=2, max=2,654, avg=186.59, stdev=524.08

read : io=10,176KiB, bw=173KiB/s, iops=5, runt= 60054msec
clat (usec): min=792, max=2,567K, avg=188841.70, stdev=517154.75

read : io=11,040KiB, bw=188KiB/s, iops=5, runt= 60003msec
clat (usec): min=779, max=2,625K, avg=173915.56, stdev=508118.60

Notes:
- Looks like vanilla CFQ gives a bit more disk access to random reader. Will
  dig into it.

Throughput and Fairness
+++++++++++++++++++++++
Test5: Bandwidth distribution between 4 sequential readers and 4 buffered
       writers
==========================================================================
Used fio to launch 4 sequential readers and 4 buffered writers and watched
how BW is distributed.

Vanilla kernel, Three sets
--------------------------
read : io=962MiB, bw=16,818KiB/s, iops=513, runt= 60008msec
read : io=969MiB, bw=16,920KiB/s, iops=516, runt= 60077msec
read : io=978MiB, bw=17,063KiB/s, iops=520, runt= 60096msec
read : io=922MiB, bw=16,106KiB/s, iops=491, runt= 60057msec
write: io=235MiB, bw=4,099KiB/s, iops=125, runt= 60049msec
write: io=226MiB, bw=3,944KiB/s, iops=120, runt= 60049msec
write: io=215MiB, bw=3,747KiB/s, iops=114, runt= 60049msec
write: io=207MiB, bw=3,606KiB/s, iops=110, runt= 60049msec
READ: io=3,832MiB, aggrb=66,868KiB/s, minb=16,106KiB/s, maxb=17,063KiB/s,
mint=60008msec, maxt=60096msec
WRITE: io=882MiB, aggrb=15,398KiB/s, minb=3,606KiB/s, maxb=4,099KiB/s,
mint=60049msec, maxt=60049msec

read : io=1,002MiB, bw=17,513KiB/s, iops=534, runt= 60020msec
read : io=979MiB, bw=17,085KiB/s, iops=521, runt= 60080msec
read : io=953MiB, bw=16,637KiB/s, iops=507, runt= 60092msec
read : io=920MiB, bw=16,057KiB/s, iops=490, runt= 60108msec
write: io=215MiB, bw=3,560KiB/s, iops=108, runt= 63289msec
write: io=136MiB, bw=2,361KiB/s, iops=72, runt= 60502msec
write: io=127MiB, bw=2,101KiB/s, iops=64, runt= 63289msec
write: io=233MiB, bw=3,852KiB/s, iops=117, runt= 63289msec
READ: io=3,855MiB, aggrb=67,256KiB/s, minb=16,057KiB/s, maxb=17,513KiB/s,
mint=60020msec, maxt=60108msec
WRITE: io=711MiB, aggrb=11,771KiB/s, minb=2,101KiB/s, maxb=3,852KiB/s,
mint=60502msec, maxt=63289msec

read : io=985MiB, bw=17,179KiB/s, iops=524, runt= 60149msec
read : io=974MiB, bw=17,025KiB/s, iops=519, runt= 60002msec
read : io=962MiB, bw=16,772KiB/s, iops=511, runt= 60170msec
read : io=932MiB, bw=16,280KiB/s, iops=496, runt= 60057msec
write: io=177MiB, bw=2,933KiB/s, iops=89, runt= 63094msec
write: io=152MiB, bw=2,637KiB/s, iops=80, runt= 60323msec
write: io=240MiB, bw=3,983KiB/s, iops=121, runt= 63094msec
write: io=147MiB, bw=2,439KiB/s, iops=74, runt= 63094msec
READ: io=3,855MiB, aggrb=67,174KiB/s, minb=16,280KiB/s, maxb=17,179KiB/s,
mint=60002msec, maxt=60170msec
WRITE: io=715MiB, aggrb=11,877KiB/s, minb=2,439KiB/s, maxb=3,983KiB/s,
mint=60323msec, maxt=63094msec

IO controller kernel three sets
-------------------------------
read : io=944MiB, bw=16,483KiB/s, iops=503, runt= 60055msec
read : io=941MiB, bw=16,433KiB/s, iops=501, runt= 60073msec
read : io=900MiB, bw=15,713KiB/s, iops=479, runt= 60040msec
read : io=866MiB, bw=15,112KiB/s, iops=461, runt= 60086msec
write: io=244MiB, bw=4,262KiB/s, iops=130, runt= 60040msec
write: io=177MiB, bw=3,085KiB/s, iops=94, runt= 60042msec
write: io=158MiB, bw=2,758KiB/s, iops=84, runt= 60041msec
write: io=180MiB, bw=3,137KiB/s, iops=95, runt= 60040msec
READ: io=3,651MiB, aggrb=63,718KiB/s, minb=15,112KiB/s, maxb=16,483KiB/s,
mint=60040msec, maxt=60086msec
WRITE: io=758MiB, aggrb=13,243KiB/s, minb=2,758KiB/s, maxb=4,262KiB/s,
mint=60040msec, maxt=60042msec

read : io=960MiB, bw=16,734KiB/s, iops=510, runt= 60137msec
read : io=917MiB, bw=16,001KiB/s, iops=488, runt= 60122msec
read : io=897MiB, bw=15,683KiB/s, iops=478, runt= 60004msec
read : io=908MiB, bw=15,824KiB/s, iops=482, runt= 60149msec
write: io=209MiB, bw=3,563KiB/s, iops=108, runt= 61400msec
write: io=177MiB, bw=3,030KiB/s, iops=92, runt= 61400msec
write: io=200MiB, bw=3,409KiB/s, iops=104, runt= 61400msec
write: io=204MiB, bw=3,489KiB/s, iops=106, runt= 61400msec
READ: io=3,682MiB, aggrb=64,194KiB/s, minb=15,683KiB/s, maxb=16,734KiB/s,
mint=60004msec, maxt=60149msec
WRITE: io=790MiB, aggrb=13,492KiB/s, minb=3,030KiB/s, maxb=3,563KiB/s,
mint=61400msec, maxt=61400msec

read : io=968MiB, bw=16,867KiB/s, iops=514, runt= 60158msec
read : io=925MiB, bw=16,135KiB/s, iops=492, runt= 60142msec
read : io=875MiB, bw=15,286KiB/s, iops=466, runt= 60003msec
read : io=872MiB, bw=15,221KiB/s, iops=464, runt= 60049msec
write: io=213MiB, bw=3,720KiB/s, iops=113, runt= 60162msec
write: io=203MiB, bw=3,536KiB/s, iops=107, runt= 60163msec
write: io=208MiB, bw=3,620KiB/s, iops=110, runt= 60162msec
write: io=203MiB, bw=3,538KiB/s, iops=107, runt= 60163msec
READ: io=3,640MiB, aggrb=63,439KiB/s, minb=15,221KiB/s, maxb=16,867KiB/s,
mint=60003msec, maxt=60158msec
WRITE: io=827MiB, aggrb=14,415KiB/s, minb=3,536KiB/s, maxb=3,720KiB/s,
mint=60162msec, maxt=60163msec

Notes: It looks like vanilla CFQ favors readers a bit more over writers as
       compared to io controller cfq. Will dig into it.
	 
Test6: Bandwidth distribution between readers of diff prio
==========================================================
Using fio, ran 8 readers of prio 0 to 7 and let it run for 30 seconds and
watched for overall throughput and who got how much IO done. 

Vanilla kernel, Three sets
---------------------------
read : io=454MiB, bw=15,865KiB/s, iops=484, runt= 30004msec
read : io=382MiB, bw=13,330KiB/s, iops=406, runt= 30086msec
read : io=325MiB, bw=11,330KiB/s, iops=345, runt= 30074msec
read : io=294MiB, bw=10,253KiB/s, iops=312, runt= 30062msec
read : io=238MiB, bw=8,321KiB/s, iops=253, runt= 30048msec
read : io=145MiB, bw=5,061KiB/s, iops=154, runt= 30032msec
read : io=99MiB, bw=3,456KiB/s, iops=105, runt= 30021msec
read : io=67,040KiB, bw=2,280KiB/s, iops=69, runt= 30108msec
READ: io=2,003MiB, aggrb=69,767KiB/s, minb=2,280KiB/s, maxb=15,865KiB/s,
mint=30004msec, maxt=30108msec

read : io=450MiB, bw=15,727KiB/s, iops=479, runt= 30001msec
read : io=371MiB, bw=12,966KiB/s, iops=395, runt= 30040msec
read : io=325MiB, bw=11,321KiB/s, iops=345, runt= 30099msec
read : io=296MiB, bw=10,332KiB/s, iops=315, runt= 30086msec
read : io=238MiB, bw=8,319KiB/s, iops=253, runt= 30056msec
read : io=152MiB, bw=5,290KiB/s, iops=161, runt= 30070msec
read : io=100MiB, bw=3,483KiB/s, iops=106, runt= 30020msec
read : io=68,832KiB, bw=2,340KiB/s, iops=71, runt= 30118msec
READ: io=2,000MiB, aggrb=69,631KiB/s, minb=2,340KiB/s, maxb=15,727KiB/s,
mint=30001msec, maxt=30118msec

read : io=450MiB, bw=15,691KiB/s, iops=478, runt= 30068msec
read : io=369MiB, bw=12,882KiB/s, iops=393, runt= 30032msec
read : io=364MiB, bw=12,732KiB/s, iops=388, runt= 30015msec
read : io=283MiB, bw=9,889KiB/s, iops=301, runt= 30002msec
read : io=228MiB, bw=7,935KiB/s, iops=242, runt= 30091msec
read : io=144MiB, bw=5,018KiB/s, iops=153, runt= 30103msec
read : io=97,760KiB, bw=3,327KiB/s, iops=101, runt= 30083msec
read : io=66,784KiB, bw=2,276KiB/s, iops=69, runt= 30046msec
READ: io=1,999MiB, aggrb=69,625KiB/s, minb=2,276KiB/s, maxb=15,691KiB/s,
mint=30002msec, maxt=30103msec

IO controller kernel, Three sets
--------------------------------
read : io=404MiB, bw=14,103KiB/s, iops=430, runt= 30072msec
read : io=344MiB, bw=11,999KiB/s, iops=366, runt= 30035msec
read : io=294MiB, bw=10,257KiB/s, iops=313, runt= 30052msec
read : io=254MiB, bw=8,888KiB/s, iops=271, runt= 30021msec
read : io=238MiB, bw=8,311KiB/s, iops=253, runt= 30086msec
read : io=177MiB, bw=6,202KiB/s, iops=189, runt= 30001msec
read : io=158MiB, bw=5,517KiB/s, iops=168, runt= 30118msec
read : io=99MiB, bw=3,464KiB/s, iops=105, runt= 30107msec
READ: io=1,971MiB, aggrb=68,604KiB/s, minb=3,464KiB/s, maxb=14,103KiB/s,
mint=30001msec, maxt=30118msec

read : io=375MiB, bw=13,066KiB/s, iops=398, runt= 30110msec
read : io=326MiB, bw=11,409KiB/s, iops=348, runt= 30003msec
read : io=308MiB, bw=10,758KiB/s, iops=328, runt= 30066msec
read : io=256MiB, bw=8,937KiB/s, iops=272, runt= 30091msec
read : io=232MiB, bw=8,088KiB/s, iops=246, runt= 30041msec
read : io=192MiB, bw=6,695KiB/s, iops=204, runt= 30077msec
read : io=144MiB, bw=5,014KiB/s, iops=153, runt= 30051msec
read : io=96,224KiB, bw=3,281KiB/s, iops=100, runt= 30026msec
READ: io=1,928MiB, aggrb=67,145KiB/s, minb=3,281KiB/s, maxb=13,066KiB/s,
mint=30003msec, maxt=30110msec

read : io=405MiB, bw=14,162KiB/s, iops=432, runt= 30021msec
read : io=354MiB, bw=12,386KiB/s, iops=378, runt= 30007msec
read : io=303MiB, bw=10,567KiB/s, iops=322, runt= 30062msec
read : io=261MiB, bw=9,126KiB/s, iops=278, runt= 30040msec
read : io=228MiB, bw=7,946KiB/s, iops=242, runt= 30048msec
read : io=178MiB, bw=6,222KiB/s, iops=189, runt= 30074msec
read : io=152MiB, bw=5,286KiB/s, iops=161, runt= 30093msec
read : io=99MiB, bw=3,446KiB/s, iops=105, runt= 30110msec
READ: io=1,981MiB, aggrb=68,996KiB/s, minb=3,446KiB/s, maxb=14,162KiB/s,
mint=30007msec, maxt=30110msec

Notes:
- It looks like overall throughput is 1-3% less in case of io controller.
- Bandwidth distribution between various prio levels has changed a bit. CFQ
  seems to have 100ms slice length for prio4 and then this slice increases
  by 20% for each prio level as prio increases and decreases by 20% as prio
  levels decrease. So Io controller does not seem to be doing too bad as in
  meeting that distribution.

Group Fairness
+++++++++++++++
Test7 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test8 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

  Higher weight dd finishes first and at that point of time my script takes
  care of reading cgroup files io.disk_time and io.disk_sectors for both the
  groups and display the results.

  dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
  dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

  group1 time=8:16 2452 group1 sectors=8:16 457856
  group2 time=8:16 1317 group2 sectors=8:16 247008

  234179072 bytes (234 MB) copied, 3.90912 s, 59.9 MB/s
  234179072 bytes (234 MB) copied, 5.15548 s, 45.4 MB/s

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test9 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.

First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.

sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &

Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)

Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.disk_time and io.disk_sectors cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for. 

For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".

sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo  1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------

Results
=======
268435456 bytes (268 MB) copied, 6.92248 s, 38.8 MB/s

group1 time=8:16 3185 group1 sectors=8:16 524824
group2 time=8:16 3190 group2 sectors=8:16 503848

Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.

Test10 (AIO)
===========

AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight

fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Results
------

test1 statistics: time=8:16 17955   sectors=8:16 1049656 dq=8:16 2
test2 statistics: time=8:16 9217   sectors=8:16 602592 dq=8:16 1

Above shows that by the time first fio (higher weight), finished, group
test1 got 17686 ms of disk time and group test2 got 9036 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"

echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Following are the results.

test1 statistics: time=8:16 25452   sectors=8:16 1049664 dq=8:16 2
test2 statistics: time=8:16 12939   sectors=8:16 532184 dq=8:16 4

Above shows that by the time first fio (higher weight), finished, group
test1 got almost double the disk time of group test2.

Test11 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 01/23] io-controller: Documentation
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 02/23] io-controller: Core of the elevator fair queuing Vivek Goyal
                     ` (29 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  407 +++++++++++++++++++++++++++++++++
 2 files changed, 409 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..21948c3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,407 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is primarily needed on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset takes the inspiration from CFS cpu scheduler and CFQ to come
+up with core of hierarchical scheduling. Like CFQ we give time slices to
+every queue based on their priority. Like CFS, this disktime given to a
+queue is converted to virtual disk time based on queue's weight (vdisktime)
+and based on this vdisktime we decide which is the queue next to be
+dispatched.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+	  This should be primarily useful when lots of asynchronous writes
+	  are being submitted by pdflush threads and we need to assign the
+	  writes to right group.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+CONFIG_DEBUG_ELV_FAIR_QUEUING
+	- Enables some vdisktime related debugging messages.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Some High Level Test setups
+===========================
+One of the use cases of IO controller is to provide some kind of IO isolation
+between multiple virtual machines on the same host. Following is one
+example setup which worked for me.
+
+
+			     KVM	     KVM
+			    Guest1	    Guest2
+			   ---------      ----------
+			  |  -----  |    |  ------  |
+			  | | vdb | |    | | vdb  | |
+			  |  -----  |    |   ------ |
+			   ---------      ----------
+
+			   ---------------------------
+			  | Host		      |
+			  |         -------------     |
+			  |        | sdb1 | sdb2 |    |
+			  |         -------------     |
+			   ---------------------------
+
+On host machine, I had a spare SATA disk. I created two partitions sdb1
+and sdb2 and gave this partitions as additional storage to kvm guests. sdb1
+to KVM guest1 and sdb2 KVM guest2. These storage appeared as /dev/vdb in
+both the guests. Formatted the /dev/vdb and created ext3 file system and
+started a 1G file writeout in both the guests. Before writeout I had created
+two cgroups of weight 1000 and 500 and put virtual machines in two different
+groups.
+
+Following is write I started in both the guests.
+
+dd if=/dev/zero of=/mnt/vdc/zerofile1 bs=4K count=262144 conv=fdatasync
+
+Following are the results on host with "deadline" scheduler.
+
+group1 time=8:16 17254 group1 sectors=8:16 2104288
+group2 time=8:16 8498  group2 sectors=8:16 1007040
+
+Virtual machine with cgroup weight 1000 got almost double the time of virtual
+machine with weight 500.
+
+What Works and What Does not
+============================
+Service differentiation at application level can be noticed only if completely
+parallel IO paths are created from application to IO scheduler and there
+are no serializations introduced by any intermediate layer. For example,
+in some cases file system and page cache layer introduce serialization and
+we don't see service difference between higher weight and lower weight
+process groups.
+
+For example, when I start an O_SYNC write out on an ext3 file system (file
+is being created newly), I see lots of activity from kjournald. I have not
+gone into details yet, but my understanding is that there are lot more
+journal commits and kjournald kind of introduces serialization between two
+processes. So even if you put these two processes in two different cgroups
+with different weights, higher weight process will not see more IO done.
+
+It does work very well when we bypass filesystem layer and IO is raw. For
+example in above virtual machine case, host sees raw synchronous writes
+coming from two guest machines and filesystem layer at host is not introducing
+any kind of serialization hence we can see the service difference.
+
+It also works very well for reads even on the same file system as for reads
+file system journalling activity does not kick in and we can create parallel
+IO paths from application to all the way down to IO scheduler and get more
+IO done on the IO path with higher weight.
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1.
+
+If fairness is set to 1, then IO controller waits for requests to finish from
+previous queue before requests from new queue are dispatched. This helps in
+doing better accouting of disk time consumed by a queue. If this is not done
+then on a queuing hardware, there can be requests from multiple queues and
+we will not have any idea which queue consumed how much of disk time.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+	  Currently allowed range of weights is from 100 to 1000.
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 01/23] io-controller: Documentation
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  407 +++++++++++++++++++++++++++++++++
 2 files changed, 409 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..21948c3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,407 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is primarily needed on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset takes the inspiration from CFS cpu scheduler and CFQ to come
+up with core of hierarchical scheduling. Like CFQ we give time slices to
+every queue based on their priority. Like CFS, this disktime given to a
+queue is converted to virtual disk time based on queue's weight (vdisktime)
+and based on this vdisktime we decide which is the queue next to be
+dispatched.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+	  This should be primarily useful when lots of asynchronous writes
+	  are being submitted by pdflush threads and we need to assign the
+	  writes to right group.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+CONFIG_DEBUG_ELV_FAIR_QUEUING
+	- Enables some vdisktime related debugging messages.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Some High Level Test setups
+===========================
+One of the use cases of IO controller is to provide some kind of IO isolation
+between multiple virtual machines on the same host. Following is one
+example setup which worked for me.
+
+
+			     KVM	     KVM
+			    Guest1	    Guest2
+			   ---------      ----------
+			  |  -----  |    |  ------  |
+			  | | vdb | |    | | vdb  | |
+			  |  -----  |    |   ------ |
+			   ---------      ----------
+
+			   ---------------------------
+			  | Host		      |
+			  |         -------------     |
+			  |        | sdb1 | sdb2 |    |
+			  |         -------------     |
+			   ---------------------------
+
+On host machine, I had a spare SATA disk. I created two partitions sdb1
+and sdb2 and gave this partitions as additional storage to kvm guests. sdb1
+to KVM guest1 and sdb2 KVM guest2. These storage appeared as /dev/vdb in
+both the guests. Formatted the /dev/vdb and created ext3 file system and
+started a 1G file writeout in both the guests. Before writeout I had created
+two cgroups of weight 1000 and 500 and put virtual machines in two different
+groups.
+
+Following is write I started in both the guests.
+
+dd if=/dev/zero of=/mnt/vdc/zerofile1 bs=4K count=262144 conv=fdatasync
+
+Following are the results on host with "deadline" scheduler.
+
+group1 time=8:16 17254 group1 sectors=8:16 2104288
+group2 time=8:16 8498  group2 sectors=8:16 1007040
+
+Virtual machine with cgroup weight 1000 got almost double the time of virtual
+machine with weight 500.
+
+What Works and What Does not
+============================
+Service differentiation at application level can be noticed only if completely
+parallel IO paths are created from application to IO scheduler and there
+are no serializations introduced by any intermediate layer. For example,
+in some cases file system and page cache layer introduce serialization and
+we don't see service difference between higher weight and lower weight
+process groups.
+
+For example, when I start an O_SYNC write out on an ext3 file system (file
+is being created newly), I see lots of activity from kjournald. I have not
+gone into details yet, but my understanding is that there are lot more
+journal commits and kjournald kind of introduces serialization between two
+processes. So even if you put these two processes in two different cgroups
+with different weights, higher weight process will not see more IO done.
+
+It does work very well when we bypass filesystem layer and IO is raw. For
+example in above virtual machine case, host sees raw synchronous writes
+coming from two guest machines and filesystem layer at host is not introducing
+any kind of serialization hence we can see the service difference.
+
+It also works very well for reads even on the same file system as for reads
+file system journalling activity does not kick in and we can create parallel
+IO paths from application to all the way down to IO scheduler and get more
+IO done on the IO path with higher weight.
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1.
+
+If fairness is set to 1, then IO controller waits for requests to finish from
+previous queue before requests from new queue are dispatched. This helps in
+doing better accouting of disk time consumed by a queue. If this is not done
+then on a queuing hardware, there can be requests from multiple queues and
+we will not have any idea which queue consumed how much of disk time.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+	  Currently allowed range of weights is from 100 to 1000.
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 01/23] io-controller: Documentation
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  407 +++++++++++++++++++++++++++++++++
 2 files changed, 409 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..21948c3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,407 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is primarily needed on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset takes the inspiration from CFS cpu scheduler and CFQ to come
+up with core of hierarchical scheduling. Like CFQ we give time slices to
+every queue based on their priority. Like CFS, this disktime given to a
+queue is converted to virtual disk time based on queue's weight (vdisktime)
+and based on this vdisktime we decide which is the queue next to be
+dispatched.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+	  This should be primarily useful when lots of asynchronous writes
+	  are being submitted by pdflush threads and we need to assign the
+	  writes to right group.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+CONFIG_DEBUG_ELV_FAIR_QUEUING
+	- Enables some vdisktime related debugging messages.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Some High Level Test setups
+===========================
+One of the use cases of IO controller is to provide some kind of IO isolation
+between multiple virtual machines on the same host. Following is one
+example setup which worked for me.
+
+
+			     KVM	     KVM
+			    Guest1	    Guest2
+			   ---------      ----------
+			  |  -----  |    |  ------  |
+			  | | vdb | |    | | vdb  | |
+			  |  -----  |    |   ------ |
+			   ---------      ----------
+
+			   ---------------------------
+			  | Host		      |
+			  |         -------------     |
+			  |        | sdb1 | sdb2 |    |
+			  |         -------------     |
+			   ---------------------------
+
+On host machine, I had a spare SATA disk. I created two partitions sdb1
+and sdb2 and gave this partitions as additional storage to kvm guests. sdb1
+to KVM guest1 and sdb2 KVM guest2. These storage appeared as /dev/vdb in
+both the guests. Formatted the /dev/vdb and created ext3 file system and
+started a 1G file writeout in both the guests. Before writeout I had created
+two cgroups of weight 1000 and 500 and put virtual machines in two different
+groups.
+
+Following is write I started in both the guests.
+
+dd if=/dev/zero of=/mnt/vdc/zerofile1 bs=4K count=262144 conv=fdatasync
+
+Following are the results on host with "deadline" scheduler.
+
+group1 time=8:16 17254 group1 sectors=8:16 2104288
+group2 time=8:16 8498  group2 sectors=8:16 1007040
+
+Virtual machine with cgroup weight 1000 got almost double the time of virtual
+machine with weight 500.
+
+What Works and What Does not
+============================
+Service differentiation at application level can be noticed only if completely
+parallel IO paths are created from application to IO scheduler and there
+are no serializations introduced by any intermediate layer. For example,
+in some cases file system and page cache layer introduce serialization and
+we don't see service difference between higher weight and lower weight
+process groups.
+
+For example, when I start an O_SYNC write out on an ext3 file system (file
+is being created newly), I see lots of activity from kjournald. I have not
+gone into details yet, but my understanding is that there are lot more
+journal commits and kjournald kind of introduces serialization between two
+processes. So even if you put these two processes in two different cgroups
+with different weights, higher weight process will not see more IO done.
+
+It does work very well when we bypass filesystem layer and IO is raw. For
+example in above virtual machine case, host sees raw synchronous writes
+coming from two guest machines and filesystem layer at host is not introducing
+any kind of serialization hence we can see the service difference.
+
+It also works very well for reads even on the same file system as for reads
+file system journalling activity does not kick in and we can create parallel
+IO paths from application to all the way down to IO scheduler and get more
+IO done on the IO path with higher weight.
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1.
+
+If fairness is set to 1, then IO controller waits for requests to finish from
+previous queue before requests from new queue are dispatched. This helps in
+doing better accouting of disk time consumed by a queue. If this is not done
+then on a queuing hardware, there can be requests from multiple queues and
+we will not have any idea which queue consumed how much of disk time.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+	  Currently allowed range of weights is from 100 to 1000.
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 02/23] io-controller: Core of the elevator fair queuing
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-08-28 21:30   ` [PATCH 01/23] io-controller: Documentation Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
                     ` (28 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This is core of the io scheduler implemented at elevator layer. This is a mix
  of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
  to be derived so that we can support hierarchical scheduling. Without
  cgroups or with-in group, we should essentially get same behavior as CFQ.

o This patch only shows non-hierarchical bits. Hierarhical code comes in later
  patches.

o This code is the building base of introducing fair queuing logic in common
  elevator layer so that it can be used by all the four IO schedulers.

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Makefile      |    2 +-
 block/elevator-fq.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  148 +++++++++++++++++++
 3 files changed, 553 insertions(+), 1 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Makefile b/block/Makefile
index 6c54ed0..19ff1e8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o
+			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..be7374d
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,404 @@
+/*
+ * elevator fair queuing Layer.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ * 	              Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+/*
+ * offset from end of service tree
+ */
+#define ELV_IDLE_DELAY		(HZ / 5)
+#define ELV_SLICE_SCALE		(500)
+#define ELV_SERVICE_SHIFT	20
+
+static inline struct io_queue *ioq_of(struct io_entity *entity)
+{
+	if (entity->my_sd == NULL)
+		return container_of(entity, struct io_queue, entity);
+	return NULL;
+}
+
+static inline int io_entity_class_rt(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int io_entity_class_idle(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline s64
+entity_key(struct io_service_tree *st, struct io_entity *entity)
+{
+	return entity->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+elv_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+	if (numerator_wt != denominator_wt) {
+		service = service * numerator_wt;
+		do_div(service, denominator_wt);
+	}
+
+	return service;
+}
+
+static inline u64 elv_delta_fair(unsigned long delta, struct io_entity *entity)
+{
+	u64 d = delta << ELV_SERVICE_SHIFT;
+
+	return elv_delta(d, IO_WEIGHT_DEFAULT, entity->weight);
+}
+
+static inline int
+elv_weight_slice(struct elv_fq_data *efqd, int sync, unsigned int weight)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(weight > IO_WEIGHT_MAX);
+
+	return elv_delta(base_slice, weight, IO_WEIGHT_DEFAULT);
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta > 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta < 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct io_service_tree *st)
+{
+	u64 vdisktime;
+
+	if (st->active_entity)
+		vdisktime = st->active_entity->vdisktime;
+
+	if (st->rb_leftmost) {
+		struct io_entity *entity = rb_entry(st->rb_leftmost,
+						struct io_entity, rb_node);
+
+		if (!st->active_entity)
+			vdisktime = entity->vdisktime;
+		else
+			vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
+	}
+
+	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+static inline struct io_group *iog_of(struct io_entity *entity)
+{
+	if (entity->my_sd)
+		return container_of(entity, struct io_group, entity);
+	return NULL;
+}
+
+static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
+{
+	return ioq_of(entity)->efqd;
+}
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	struct elv_fq_data *efqd = efqd_of(entity);
+
+	return &efqd->root_group->sched_data;
+}
+
+static inline void
+init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
+{
+	struct io_group *parent_iog = iog_of(parent);
+	unsigned short idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+	entity->st = &parent_iog->sched_data.service_tree[idx];
+}
+
+static void
+entity_served(struct io_entity *entity, unsigned long served,
+				unsigned long nr_sectors)
+{
+	entity->vdisktime += elv_delta_fair(served, entity);
+	update_min_vdisktime(entity->st);
+}
+
+static void place_entity(struct io_service_tree *st, struct io_entity *entity,
+				int add_front)
+{
+	u64 vdisktime = st->min_vdisktime;
+	struct rb_node *parent;
+	struct io_entity *entry;
+	int nr_active = st->nr_active - 1;
+
+	/*
+	 * Currently put entity at the end of last entity. This probably will
+	 * require adjustments as we move along
+	 */
+	if (io_entity_class_idle(entity)) {
+		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime += entry->vdisktime;
+		}
+	} else if (!add_front && nr_active) {
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime = entry->vdisktime;
+		}
+	} else
+		vdisktime = st->min_vdisktime;
+
+	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void io_entity_update_prio(struct io_entity *entity)
+{
+	if (unlikely(entity->ioprio_changed)) {
+		/*
+		 * Re-initialize the service tree as ioprio class of the
+		 * entity might have changed.
+		 */
+		init_io_entity_service_tree(entity, parent_entity(entity));
+		entity->ioprio_changed = 0;
+	}
+}
+
+static void
+__dequeue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	/*
+	 * This can happen when during put_prev_io_entity, we detect that ioprio
+	 * of the queue has changed and decide to dequeue_entity() and requeue
+	 * back. In this case entity is on service tree but has already been
+	 * removed from rb tree.
+	 */
+	if (RB_EMPTY_NODE(&entity->rb_node))
+		return;
+
+	if (st->rb_leftmost == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->active);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	__dequeue_io_entity(st, entity);
+	entity->on_st = 0;
+	st->nr_active--;
+	sd->nr_active--;
+}
+
+static void
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->active.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	s64 key = entity_key(st, entity);
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (key < entity_key(st, entry)) {
+			node = &parent->rb_left;
+		} else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->active);
+}
+
+static void enqueue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	io_entity_update_prio(entity);
+	st = entity->st;
+	st->nr_active++;
+	sd->nr_active++;
+	entity->on_st = 1;
+	place_entity(st, entity, 0);
+	__enqueue_io_entity(st, entity);
+}
+
+static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
+{
+	struct rb_node *left = st->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct io_entity, rb_node);
+}
+
+static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity = NULL;
+	int i;
+
+	BUG_ON(sd->active_entity != NULL);
+
+	if (!sd->nr_active)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __lookup_next_io_entity(st);
+		if (entity) {
+			__dequeue_io_entity(st, entity);
+			st->active_entity = entity;
+			sd->active_entity = entity;
+			break;
+		}
+	}
+
+	return entity;
+}
+
+static void requeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_entity *next_entity;
+
+	next_entity = __lookup_next_io_entity(st);
+
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 *
+	 * This feature is also used by cfq close cooperator functionlity
+	 * where cfq selects a queue out of order to run next based on
+	 * close cooperator.
+	 */
+
+	if (next_entity && next_entity != entity) {
+		__dequeue_io_entity(st, entity);
+		place_entity(st, entity, 1);
+		__enqueue_io_entity(st, entity);
+	}
+}
+
+/* Requeue and ioq (already on the tree) to the front of service tree */
+static void requeue_ioq(struct io_queue *ioq)
+{
+	requeue_io_entity(&ioq->entity);
+}
+
+static void put_prev_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	st->active_entity = NULL;
+	sd->active_entity = NULL;
+
+	if (unlikely(entity->ioprio_changed)) {
+		dequeue_io_entity(entity);
+		enqueue_io_entity(entity);
+	} else
+		__enqueue_io_entity(st, entity);
+}
+
+/* Put curr ioq back into rb tree. */
+static void put_prev_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	put_prev_io_entity(entity);
+}
+
+static void dequeue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	dequeue_io_entity(entity);
+	elv_put_ioq(ioq);
+	return;
+}
+
+/* Put a new queue on to the tree */
+static void enqueue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	elv_get_ioq(ioq);
+	enqueue_io_entity(entity);
+}
+
+static inline void
+init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
+{
+	entity->parent = parent;
+	init_io_entity_service_tree(entity, parent);
+}
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..868e035
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,148 @@
+/*
+ * elevator fair queuing Layer. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ * 	              Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
+ */
+
+#ifdef CONFIG_BLOCK
+#include <linux/blkdev.h>
+
+#ifndef _ELV_SCHED_H
+#define _ELV_SCHED_H
+
+#define IO_WEIGHT_MIN		100
+#define IO_WEIGHT_MAX		1000
+#define IO_WEIGHT_DEFAULT	500
+#define IO_IOPRIO_CLASSES	3
+
+struct io_service_tree {
+	struct rb_root active;
+	struct io_entity *active_entity;
+	u64 min_vdisktime;
+	struct rb_node *rb_leftmost;
+	unsigned int nr_active;
+};
+
+struct io_sched_data {
+	struct io_entity *active_entity;
+	int nr_active;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+struct io_entity {
+	struct rb_node rb_node;
+	int on_st;
+	u64 vdisktime;
+	unsigned int weight;
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sd;
+	struct io_service_tree *st;
+
+	unsigned short ioprio, ioprio_class;
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_entity entity;
+	struct io_sched_data sched_data;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
+};
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(sync)
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline unsigned int elv_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	/* Map prio 7 - 0 to weights 200 to 900 */
+	return IO_WEIGHT_DEFAULT + (IO_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.ioprio = ioprio;
+	ioq->entity.weight = elv_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio;
+}
+
+extern void elv_put_ioq(struct io_queue *ioq);
+#endif /* _ELV_SCHED_H */
+#endif /* CONFIG_BLOCK */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 02/23] io-controller: Core of the elevator fair queuing
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o This is core of the io scheduler implemented at elevator layer. This is a mix
  of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
  to be derived so that we can support hierarchical scheduling. Without
  cgroups or with-in group, we should essentially get same behavior as CFQ.

o This patch only shows non-hierarchical bits. Hierarhical code comes in later
  patches.

o This code is the building base of introducing fair queuing logic in common
  elevator layer so that it can be used by all the four IO schedulers.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Makefile      |    2 +-
 block/elevator-fq.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  148 +++++++++++++++++++
 3 files changed, 553 insertions(+), 1 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Makefile b/block/Makefile
index 6c54ed0..19ff1e8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o
+			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..be7374d
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,404 @@
+/*
+ * elevator fair queuing Layer.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+/*
+ * offset from end of service tree
+ */
+#define ELV_IDLE_DELAY		(HZ / 5)
+#define ELV_SLICE_SCALE		(500)
+#define ELV_SERVICE_SHIFT	20
+
+static inline struct io_queue *ioq_of(struct io_entity *entity)
+{
+	if (entity->my_sd == NULL)
+		return container_of(entity, struct io_queue, entity);
+	return NULL;
+}
+
+static inline int io_entity_class_rt(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int io_entity_class_idle(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline s64
+entity_key(struct io_service_tree *st, struct io_entity *entity)
+{
+	return entity->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+elv_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+	if (numerator_wt != denominator_wt) {
+		service = service * numerator_wt;
+		do_div(service, denominator_wt);
+	}
+
+	return service;
+}
+
+static inline u64 elv_delta_fair(unsigned long delta, struct io_entity *entity)
+{
+	u64 d = delta << ELV_SERVICE_SHIFT;
+
+	return elv_delta(d, IO_WEIGHT_DEFAULT, entity->weight);
+}
+
+static inline int
+elv_weight_slice(struct elv_fq_data *efqd, int sync, unsigned int weight)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(weight > IO_WEIGHT_MAX);
+
+	return elv_delta(base_slice, weight, IO_WEIGHT_DEFAULT);
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta > 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta < 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct io_service_tree *st)
+{
+	u64 vdisktime;
+
+	if (st->active_entity)
+		vdisktime = st->active_entity->vdisktime;
+
+	if (st->rb_leftmost) {
+		struct io_entity *entity = rb_entry(st->rb_leftmost,
+						struct io_entity, rb_node);
+
+		if (!st->active_entity)
+			vdisktime = entity->vdisktime;
+		else
+			vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
+	}
+
+	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+static inline struct io_group *iog_of(struct io_entity *entity)
+{
+	if (entity->my_sd)
+		return container_of(entity, struct io_group, entity);
+	return NULL;
+}
+
+static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
+{
+	return ioq_of(entity)->efqd;
+}
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	struct elv_fq_data *efqd = efqd_of(entity);
+
+	return &efqd->root_group->sched_data;
+}
+
+static inline void
+init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
+{
+	struct io_group *parent_iog = iog_of(parent);
+	unsigned short idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+	entity->st = &parent_iog->sched_data.service_tree[idx];
+}
+
+static void
+entity_served(struct io_entity *entity, unsigned long served,
+				unsigned long nr_sectors)
+{
+	entity->vdisktime += elv_delta_fair(served, entity);
+	update_min_vdisktime(entity->st);
+}
+
+static void place_entity(struct io_service_tree *st, struct io_entity *entity,
+				int add_front)
+{
+	u64 vdisktime = st->min_vdisktime;
+	struct rb_node *parent;
+	struct io_entity *entry;
+	int nr_active = st->nr_active - 1;
+
+	/*
+	 * Currently put entity at the end of last entity. This probably will
+	 * require adjustments as we move along
+	 */
+	if (io_entity_class_idle(entity)) {
+		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime += entry->vdisktime;
+		}
+	} else if (!add_front && nr_active) {
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime = entry->vdisktime;
+		}
+	} else
+		vdisktime = st->min_vdisktime;
+
+	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void io_entity_update_prio(struct io_entity *entity)
+{
+	if (unlikely(entity->ioprio_changed)) {
+		/*
+		 * Re-initialize the service tree as ioprio class of the
+		 * entity might have changed.
+		 */
+		init_io_entity_service_tree(entity, parent_entity(entity));
+		entity->ioprio_changed = 0;
+	}
+}
+
+static void
+__dequeue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	/*
+	 * This can happen when during put_prev_io_entity, we detect that ioprio
+	 * of the queue has changed and decide to dequeue_entity() and requeue
+	 * back. In this case entity is on service tree but has already been
+	 * removed from rb tree.
+	 */
+	if (RB_EMPTY_NODE(&entity->rb_node))
+		return;
+
+	if (st->rb_leftmost == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->active);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	__dequeue_io_entity(st, entity);
+	entity->on_st = 0;
+	st->nr_active--;
+	sd->nr_active--;
+}
+
+static void
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->active.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	s64 key = entity_key(st, entity);
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (key < entity_key(st, entry)) {
+			node = &parent->rb_left;
+		} else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->active);
+}
+
+static void enqueue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	io_entity_update_prio(entity);
+	st = entity->st;
+	st->nr_active++;
+	sd->nr_active++;
+	entity->on_st = 1;
+	place_entity(st, entity, 0);
+	__enqueue_io_entity(st, entity);
+}
+
+static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
+{
+	struct rb_node *left = st->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct io_entity, rb_node);
+}
+
+static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity = NULL;
+	int i;
+
+	BUG_ON(sd->active_entity != NULL);
+
+	if (!sd->nr_active)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __lookup_next_io_entity(st);
+		if (entity) {
+			__dequeue_io_entity(st, entity);
+			st->active_entity = entity;
+			sd->active_entity = entity;
+			break;
+		}
+	}
+
+	return entity;
+}
+
+static void requeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_entity *next_entity;
+
+	next_entity = __lookup_next_io_entity(st);
+
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 *
+	 * This feature is also used by cfq close cooperator functionlity
+	 * where cfq selects a queue out of order to run next based on
+	 * close cooperator.
+	 */
+
+	if (next_entity && next_entity != entity) {
+		__dequeue_io_entity(st, entity);
+		place_entity(st, entity, 1);
+		__enqueue_io_entity(st, entity);
+	}
+}
+
+/* Requeue and ioq (already on the tree) to the front of service tree */
+static void requeue_ioq(struct io_queue *ioq)
+{
+	requeue_io_entity(&ioq->entity);
+}
+
+static void put_prev_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	st->active_entity = NULL;
+	sd->active_entity = NULL;
+
+	if (unlikely(entity->ioprio_changed)) {
+		dequeue_io_entity(entity);
+		enqueue_io_entity(entity);
+	} else
+		__enqueue_io_entity(st, entity);
+}
+
+/* Put curr ioq back into rb tree. */
+static void put_prev_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	put_prev_io_entity(entity);
+}
+
+static void dequeue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	dequeue_io_entity(entity);
+	elv_put_ioq(ioq);
+	return;
+}
+
+/* Put a new queue on to the tree */
+static void enqueue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	elv_get_ioq(ioq);
+	enqueue_io_entity(entity);
+}
+
+static inline void
+init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
+{
+	entity->parent = parent;
+	init_io_entity_service_tree(entity, parent);
+}
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..868e035
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,148 @@
+/*
+ * elevator fair queuing Layer. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#ifdef CONFIG_BLOCK
+#include <linux/blkdev.h>
+
+#ifndef _ELV_SCHED_H
+#define _ELV_SCHED_H
+
+#define IO_WEIGHT_MIN		100
+#define IO_WEIGHT_MAX		1000
+#define IO_WEIGHT_DEFAULT	500
+#define IO_IOPRIO_CLASSES	3
+
+struct io_service_tree {
+	struct rb_root active;
+	struct io_entity *active_entity;
+	u64 min_vdisktime;
+	struct rb_node *rb_leftmost;
+	unsigned int nr_active;
+};
+
+struct io_sched_data {
+	struct io_entity *active_entity;
+	int nr_active;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+struct io_entity {
+	struct rb_node rb_node;
+	int on_st;
+	u64 vdisktime;
+	unsigned int weight;
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sd;
+	struct io_service_tree *st;
+
+	unsigned short ioprio, ioprio_class;
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_entity entity;
+	struct io_sched_data sched_data;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
+};
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(sync)
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline unsigned int elv_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	/* Map prio 7 - 0 to weights 200 to 900 */
+	return IO_WEIGHT_DEFAULT + (IO_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.ioprio = ioprio;
+	ioq->entity.weight = elv_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio;
+}
+
+extern void elv_put_ioq(struct io_queue *ioq);
+#endif /* _ELV_SCHED_H */
+#endif /* CONFIG_BLOCK */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 02/23] io-controller: Core of the elevator fair queuing
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o This is core of the io scheduler implemented at elevator layer. This is a mix
  of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
  to be derived so that we can support hierarchical scheduling. Without
  cgroups or with-in group, we should essentially get same behavior as CFQ.

o This patch only shows non-hierarchical bits. Hierarhical code comes in later
  patches.

o This code is the building base of introducing fair queuing logic in common
  elevator layer so that it can be used by all the four IO schedulers.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Makefile      |    2 +-
 block/elevator-fq.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  148 +++++++++++++++++++
 3 files changed, 553 insertions(+), 1 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Makefile b/block/Makefile
index 6c54ed0..19ff1e8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o
+			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..be7374d
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,404 @@
+/*
+ * elevator fair queuing Layer.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+/*
+ * offset from end of service tree
+ */
+#define ELV_IDLE_DELAY		(HZ / 5)
+#define ELV_SLICE_SCALE		(500)
+#define ELV_SERVICE_SHIFT	20
+
+static inline struct io_queue *ioq_of(struct io_entity *entity)
+{
+	if (entity->my_sd == NULL)
+		return container_of(entity, struct io_queue, entity);
+	return NULL;
+}
+
+static inline int io_entity_class_rt(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int io_entity_class_idle(struct io_entity *entity)
+{
+	return entity->ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline s64
+entity_key(struct io_service_tree *st, struct io_entity *entity)
+{
+	return entity->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+elv_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+	if (numerator_wt != denominator_wt) {
+		service = service * numerator_wt;
+		do_div(service, denominator_wt);
+	}
+
+	return service;
+}
+
+static inline u64 elv_delta_fair(unsigned long delta, struct io_entity *entity)
+{
+	u64 d = delta << ELV_SERVICE_SHIFT;
+
+	return elv_delta(d, IO_WEIGHT_DEFAULT, entity->weight);
+}
+
+static inline int
+elv_weight_slice(struct elv_fq_data *efqd, int sync, unsigned int weight)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(weight > IO_WEIGHT_MAX);
+
+	return elv_delta(base_slice, weight, IO_WEIGHT_DEFAULT);
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta > 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+	s64 delta = (s64)(vdisktime - min_vdisktime);
+	if (delta < 0)
+		min_vdisktime = vdisktime;
+
+	return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct io_service_tree *st)
+{
+	u64 vdisktime;
+
+	if (st->active_entity)
+		vdisktime = st->active_entity->vdisktime;
+
+	if (st->rb_leftmost) {
+		struct io_entity *entity = rb_entry(st->rb_leftmost,
+						struct io_entity, rb_node);
+
+		if (!st->active_entity)
+			vdisktime = entity->vdisktime;
+		else
+			vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
+	}
+
+	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+static inline struct io_group *iog_of(struct io_entity *entity)
+{
+	if (entity->my_sd)
+		return container_of(entity, struct io_group, entity);
+	return NULL;
+}
+
+static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
+{
+	return ioq_of(entity)->efqd;
+}
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	struct elv_fq_data *efqd = efqd_of(entity);
+
+	return &efqd->root_group->sched_data;
+}
+
+static inline void
+init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
+{
+	struct io_group *parent_iog = iog_of(parent);
+	unsigned short idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+	entity->st = &parent_iog->sched_data.service_tree[idx];
+}
+
+static void
+entity_served(struct io_entity *entity, unsigned long served,
+				unsigned long nr_sectors)
+{
+	entity->vdisktime += elv_delta_fair(served, entity);
+	update_min_vdisktime(entity->st);
+}
+
+static void place_entity(struct io_service_tree *st, struct io_entity *entity,
+				int add_front)
+{
+	u64 vdisktime = st->min_vdisktime;
+	struct rb_node *parent;
+	struct io_entity *entry;
+	int nr_active = st->nr_active - 1;
+
+	/*
+	 * Currently put entity at the end of last entity. This probably will
+	 * require adjustments as we move along
+	 */
+	if (io_entity_class_idle(entity)) {
+		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime += entry->vdisktime;
+		}
+	} else if (!add_front && nr_active) {
+		parent = rb_last(&st->active);
+		if (parent) {
+			entry = rb_entry(parent, struct io_entity, rb_node);
+			vdisktime = entry->vdisktime;
+		}
+	} else
+		vdisktime = st->min_vdisktime;
+
+	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void io_entity_update_prio(struct io_entity *entity)
+{
+	if (unlikely(entity->ioprio_changed)) {
+		/*
+		 * Re-initialize the service tree as ioprio class of the
+		 * entity might have changed.
+		 */
+		init_io_entity_service_tree(entity, parent_entity(entity));
+		entity->ioprio_changed = 0;
+	}
+}
+
+static void
+__dequeue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	/*
+	 * This can happen when during put_prev_io_entity, we detect that ioprio
+	 * of the queue has changed and decide to dequeue_entity() and requeue
+	 * back. In this case entity is on service tree but has already been
+	 * removed from rb tree.
+	 */
+	if (RB_EMPTY_NODE(&entity->rb_node))
+		return;
+
+	if (st->rb_leftmost == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->active);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	__dequeue_io_entity(st, entity);
+	entity->on_st = 0;
+	st->nr_active--;
+	sd->nr_active--;
+}
+
+static void
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->active.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	s64 key = entity_key(st, entity);
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (key < entity_key(st, entry)) {
+			node = &parent->rb_left;
+		} else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->active);
+}
+
+static void enqueue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	io_entity_update_prio(entity);
+	st = entity->st;
+	st->nr_active++;
+	sd->nr_active++;
+	entity->on_st = 1;
+	place_entity(st, entity, 0);
+	__enqueue_io_entity(st, entity);
+}
+
+static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
+{
+	struct rb_node *left = st->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct io_entity, rb_node);
+}
+
+static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity = NULL;
+	int i;
+
+	BUG_ON(sd->active_entity != NULL);
+
+	if (!sd->nr_active)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __lookup_next_io_entity(st);
+		if (entity) {
+			__dequeue_io_entity(st, entity);
+			st->active_entity = entity;
+			sd->active_entity = entity;
+			break;
+		}
+	}
+
+	return entity;
+}
+
+static void requeue_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_entity *next_entity;
+
+	next_entity = __lookup_next_io_entity(st);
+
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 *
+	 * This feature is also used by cfq close cooperator functionlity
+	 * where cfq selects a queue out of order to run next based on
+	 * close cooperator.
+	 */
+
+	if (next_entity && next_entity != entity) {
+		__dequeue_io_entity(st, entity);
+		place_entity(st, entity, 1);
+		__enqueue_io_entity(st, entity);
+	}
+}
+
+/* Requeue and ioq (already on the tree) to the front of service tree */
+static void requeue_ioq(struct io_queue *ioq)
+{
+	requeue_io_entity(&ioq->entity);
+}
+
+static void put_prev_io_entity(struct io_entity *entity)
+{
+	struct io_service_tree *st = entity->st;
+	struct io_sched_data *sd = io_entity_sched_data(entity);
+
+	st->active_entity = NULL;
+	sd->active_entity = NULL;
+
+	if (unlikely(entity->ioprio_changed)) {
+		dequeue_io_entity(entity);
+		enqueue_io_entity(entity);
+	} else
+		__enqueue_io_entity(st, entity);
+}
+
+/* Put curr ioq back into rb tree. */
+static void put_prev_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	put_prev_io_entity(entity);
+}
+
+static void dequeue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	dequeue_io_entity(entity);
+	elv_put_ioq(ioq);
+	return;
+}
+
+/* Put a new queue on to the tree */
+static void enqueue_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	elv_get_ioq(ioq);
+	enqueue_io_entity(entity);
+}
+
+static inline void
+init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
+{
+	entity->parent = parent;
+	init_io_entity_service_tree(entity, parent);
+}
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..868e035
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,148 @@
+/*
+ * elevator fair queuing Layer. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#ifdef CONFIG_BLOCK
+#include <linux/blkdev.h>
+
+#ifndef _ELV_SCHED_H
+#define _ELV_SCHED_H
+
+#define IO_WEIGHT_MIN		100
+#define IO_WEIGHT_MAX		1000
+#define IO_WEIGHT_DEFAULT	500
+#define IO_IOPRIO_CLASSES	3
+
+struct io_service_tree {
+	struct rb_root active;
+	struct io_entity *active_entity;
+	u64 min_vdisktime;
+	struct rb_node *rb_leftmost;
+	unsigned int nr_active;
+};
+
+struct io_sched_data {
+	struct io_entity *active_entity;
+	int nr_active;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+struct io_entity {
+	struct rb_node rb_node;
+	int on_st;
+	u64 vdisktime;
+	unsigned int weight;
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sd;
+	struct io_service_tree *st;
+
+	unsigned short ioprio, ioprio_class;
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_entity entity;
+	struct io_sched_data sched_data;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
+};
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(sync)
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline unsigned int elv_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	/* Map prio 7 - 0 to weights 200 to 900 */
+	return IO_WEIGHT_DEFAULT + (IO_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.ioprio = ioprio;
+	ioq->entity.weight = elv_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio;
+}
+
+extern void elv_put_ioq(struct io_queue *ioq);
+#endif /* _ELV_SCHED_H */
+#endif /* CONFIG_BLOCK */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-08-28 21:30   ` [PATCH 01/23] io-controller: Documentation Vivek Goyal
  2009-08-28 21:30   ` [PATCH 02/23] io-controller: Core of the elevator fair queuing Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
                     ` (27 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This is essentially a lot of CFQ logic moved into common layer so that other
IO schedulers can make use of that in hierarhical scheduling setup.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    3 +-
 block/as-iosched.c       |    2 +-
 block/blk.h              |    6 +
 block/cfq-iosched.c      |    2 +-
 block/deadline-iosched.c |    3 +-
 block/elevator-fq.c      |  985 ++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h      |  229 +++++++++++
 block/elevator.c         |   63 +++-
 block/noop-iosched.c     |    2 +-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   50 +++-
 12 files changed, 1330 insertions(+), 42 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index 19ff1e8..d545323 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
+			ioctl.o genhd.o scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..b90acbe 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1351,7 +1351,7 @@ static void as_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (as_data).
  */
-static void *as_init_queue(struct request_queue *q)
+static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct as_data *ad;
 
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..d05b4cf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,6 +1,8 @@
 #ifndef BLK_INTERNAL_H
 #define BLK_INTERNAL_H
 
+#include "elevator-fq.h"
+
 /* Amount of time in which a process may batch requests */
 #define BLK_BATCH_TIME	(HZ/50UL)
 
@@ -71,6 +73,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_activate_rq_fair(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +83,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_deactivate_rq_fair(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..5a67ec0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2448,7 +2448,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	kfree(cfqd);
 }
 
-static void *cfq_init_queue(struct request_queue *q)
+static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct cfq_data *cfqd;
 	int i;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..25af8b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -347,7 +347,8 @@ static void deadline_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static void *deadline_init_queue(struct request_queue *q)
+static void *
+deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index be7374d..1ca7b4a 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,14 +12,23 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/blktrace_api.h>
 #include "elevator-fq.h"
 
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+static struct kmem_cache *elv_ioq_pool;
+
 /*
  * offset from end of service tree
  */
 #define ELV_IDLE_DELAY		(HZ / 5)
 #define ELV_SLICE_SCALE		(500)
 #define ELV_SERVICE_SHIFT	20
+#define ELV_HW_QUEUE_MIN	(5)
+#define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
+				{ RB_ROOT, NULL, 0, NULL, 0})
 
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
@@ -98,7 +107,7 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
 
 static void update_min_vdisktime(struct io_service_tree *st)
 {
-	u64 vdisktime;
+	u64 vdisktime = st->min_vdisktime;
 
 	if (st->active_entity)
 		vdisktime = st->active_entity->vdisktime;
@@ -133,6 +142,12 @@ static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 	return ioq_of(entity)->efqd;
 }
 
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return ioq->efqd->root_group;
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
 static inline struct io_sched_data *
 io_entity_sched_data(struct io_entity *entity)
 {
@@ -238,7 +253,8 @@ static void dequeue_io_entity(struct io_entity *entity)
 }
 
 static void
-__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity,
+			int add_front)
 {
 	struct rb_node **node = &st->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -250,7 +266,8 @@ __enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
 		parent = *node;
 		entry = rb_entry(parent, struct io_entity, rb_node);
 
-		if (key < entity_key(st, entry)) {
+		if (key < entity_key(st, entry) ||
+			(add_front && (key == entity_key(st, entry)))) {
 			node = &parent->rb_left;
 		} else {
 			node = &parent->rb_right;
@@ -280,7 +297,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	sd->nr_active++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
-	__enqueue_io_entity(st, entity);
+	__enqueue_io_entity(st, entity, 0);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -310,6 +327,7 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 			__dequeue_io_entity(st, entity);
 			st->active_entity = entity;
 			sd->active_entity = entity;
+			update_min_vdisktime(entity->st);
 			break;
 		}
 	}
@@ -317,35 +335,37 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 	return entity;
 }
 
-static void requeue_io_entity(struct io_entity *entity)
+static void requeue_io_entity(struct io_entity *entity, int add_front)
 {
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
-	next_entity = __lookup_next_io_entity(st);
+	if (add_front) {
+		next_entity = __lookup_next_io_entity(st);
 
-	/*
-	 * This is to emulate cfq like functionality where preemption can
-	 * happen with-in same class, like sync queue preempting async queue
-	 * May be this is not a very good idea from fairness point of view
-	 * as preempting queue gains share. Keeping it for now.
-	 *
-	 * This feature is also used by cfq close cooperator functionlity
-	 * where cfq selects a queue out of order to run next based on
-	 * close cooperator.
-	 */
+		/*
+		 * This is to emulate cfq like functionality where preemption
+		 * can happen with-in same class, like sync queue preempting
+		 * async queue.
+		 *
+		 * This feature is also used by cfq close cooperator
+		 * functionlity where cfq selects a queue out of order to run
+		 * next based on close cooperator.
+		 */
 
-	if (next_entity && next_entity != entity) {
-		__dequeue_io_entity(st, entity);
-		place_entity(st, entity, 1);
-		__enqueue_io_entity(st, entity);
+		if (next_entity && next_entity == entity)
+			return;
 	}
+
+	__dequeue_io_entity(st, entity);
+	place_entity(st, entity, add_front);
+	__enqueue_io_entity(st, entity, add_front);
 }
 
-/* Requeue and ioq (already on the tree) to the front of service tree */
-static void requeue_ioq(struct io_queue *ioq)
+/* Requeue and ioq which is already on the tree */
+static void requeue_ioq(struct io_queue *ioq, int add_front)
 {
-	requeue_io_entity(&ioq->entity);
+	requeue_io_entity(&ioq->entity, add_front);
 }
 
 static void put_prev_io_entity(struct io_entity *entity)
@@ -360,7 +380,7 @@ static void put_prev_io_entity(struct io_entity *entity)
 		dequeue_io_entity(entity);
 		enqueue_io_entity(entity);
 	} else
-		__enqueue_io_entity(st, entity);
+		__enqueue_io_entity(st, entity, 0);
 }
 
 /* Put curr ioq back into rb tree. */
@@ -398,7 +418,924 @@ init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
 
 void elv_put_ioq(struct io_queue *ioq)
 {
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = efqd->eq;
+
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
 }
+EXPORT_SYMBOL(elv_put_ioq);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(q, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd->idle_slice_timer);
+	cancel_work_sync(&e->efqd->unplug_work);
+}
+
+static void elv_set_prio_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	ioq->slice_start = jiffies;
+	ioq->slice_end = elv_prio_to_slice(efqd, ioq) + jiffies;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->slice_end - jiffies);
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
+		int is_sync)
+{
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = eq->efqd;
+	ioq->pid = pid;
+
+	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+						int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(elv_io_group_async_queue_prio);
+
+void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_io_group_set_async_queue);
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+/*
+ * Should be called after ioq prio and class has been initialized as prio
+ * class data will be used to determine which service tree in the group
+ * entity should be attached to.
+ */
+void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog)
+{
+	init_io_entity_parent(&ioq->entity, &iog->entity);
+}
+EXPORT_SYMBOL(elv_init_ioq_io_group);
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = lookup_next_io_entity(sd);
+	if (!entity)
+		return NULL;
+
+	ioq = ioq_of(entity);
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void
+__elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+						efqd->busy_queues);
+		ioq->slice_start = ioq->slice_end = 0;
+		ioq->dispatch_start = jiffies;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq && eq->ops->elevator_active_ioq_set_fn)
+		eq->ops->elevator_active_ioq_set_fn(q, ioq->sched_queue, coop);
+}
+
+/* Get and set a new active queue for service. */
+static struct
+io_queue *elv_set_active_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	int coop = 0;
+
+	if (ioq) {
+		requeue_ioq(ioq, 1);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+
+	ioq = elv_get_next_ioq(q);
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_active_ioq_reset_fn)
+		eq->ops->elevator_active_ioq_reset_fn(q, ioq->sched_queue);
+
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	enqueue_ioq(ioq);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	dequeue_ioq(ioq);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Currently one should set fairness = 1 to force completion of requests
+ * from queue before dispatch from next queue starts. This should help in
+ * better time accounting at the expense of throughput.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	long slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * Queue got expired before even a single request completed or
+	 * got expired immediately after first request completion. Use
+	 * the time elapsed since queue was scheduled in.
+	 */
+	if (!ioq->slice_end || ioq->slice_start == jiffies) {
+		slice_used = jiffies - ioq->dispatch_start;
+		if (!slice_used)
+			slice_used = 1;
+		goto done;
+	}
+
+	slice_used = jiffies - ioq->slice_start;
+	if (time_after(jiffies, ioq->slice_end))
+		slice_overshoot = jiffies - ioq->slice_end;
+
+done:
+	elv_log_ioq(efqd, ioq, "disp_start = %lu sl_start= %lu sl_end=%lu,"
+			" jiffies=%lu", ioq->dispatch_start, ioq->slice_start,
+			ioq->slice_end, jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, overshoot=%ld sect=%lu",
+				slice_used, slice_overshoot, ioq->nr_sectors);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
+
+	put_prev_ioq(ioq);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq);
+	else if (!elv_ioq_sync(ioq)) {
+		/*
+		 * Requeue async ioq so that these will be again placed at
+		 * the end of service tree giving a chance to sync queues.
+		 */
+		requeue_ioq(ioq, 0);
+	}
+}
+EXPORT_SYMBOL(elv_ioq_slice_expired);
+
+/* Expire the ioq. */
+void elv_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn) {
+		void *sched_queue = elv_ioq_sched_queue(new_ioq);
+
+		return eq->ops->elevator_should_preempt_fn(q, sched_queue, rq);
+	}
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+	elv_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	requeue_ioq(ioq, 1);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	ioq->nr_queued++;
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
+				__blk_run_queue(q);
+			else
+				elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_arm_slice_timer_fn)
+		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+	void *sched_queue = ioq->sched_queue;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q, sched_queue);
+
+	if (new_ioq)
+		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	ioq->dispatched++;
+	ioq->nr_sectors += blk_rq_sectors(rq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_activate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+			ioq->nr_queued, efqd->rq_in_driver,
+			elv_ioq_nr_dispatched(ioq));
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_set_prio_slice(q->elevator->efqd, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = NULL;
+
+	efqd = kmalloc_node(sizeof(*efqd), GFP_KERNEL | __GFP_ZERO, q->node);
+	return efqd;
+}
+
+void elv_release_fq_data(struct elv_fq_data *efqd)
+{
+	kfree(efqd);
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+
+	/*
+	 * Our fallback ioq if elv_alloc_ioq() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	elv_init_ioq(e, &efqd->oom_ioq, 1, 0);
+	elv_get_ioq(&efqd->oom_ioq);
+	elv_init_ioq_io_group(&efqd->oom_ioq, iog);
+
+	efqd->queue = q;
+	efqd->eq = e;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 868e035..6d3809f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 #define IO_WEIGHT_DEFAULT	500
 #define IO_IOPRIO_CLASSES	3
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 struct io_service_tree {
 	struct rb_root active;
 	struct io_entity *active_entity;
@@ -61,23 +65,80 @@ struct io_queue {
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
+	/* time when dispatch from the queue was started */
+	unsigned long dispatch_start;
+	/* time when first request from queue completed and slice started. */
+	unsigned long slice_start;
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
 };
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	struct request_queue *queue;
+	struct elevator_queue *eq;
+	unsigned int busy_queues;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fallback dummy ioq for extreme OOM conditions */
+	struct io_queue oom_ioq;
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
 /* Some shared queue flag manipulation functions among elevators */
 
 enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 };
 
@@ -95,6 +156,11 @@ static inline int elv_ioq_##name(struct io_queue *ioq)         		\
 	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
 }
 
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
 static inline void elv_get_ioq(struct io_queue *ioq)
@@ -143,6 +209,169 @@ static inline int elv_ioq_ioprio(struct io_queue *ioq)
 	return ioq->entity.ioprio;
 }
 
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd->active_queue;
+}
+
+static inline void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return elv_ioq_sched_queue(elv_active_ioq(e));
+}
+
+static inline int elv_rq_in_driver(struct elevator_queue *e)
+{
+	return e->efqd->rq_in_driver;
+}
+
+static inline int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd->busy_queues;
+}
+
+/* Helper functions for operating on elevator idle slice timer */
+static inline int
+elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	return mod_timer(&eq->efqd->idle_slice_timer, expires);
+}
+
+static inline int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	return del_timer(&eq->efqd->idle_slice_timer);
+}
+
+static inline void
+elv_init_ioq_sched_queue(struct elevator_queue *eq, struct io_queue *ioq,
+					void *sched_queue)
+{
+	ioq->sched_queue = sched_queue;
+}
+
+static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
+{
+	return &eq->efqd->oom_ioq;
+}
+
+static inline struct io_group *
+elv_io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd->root_group;
+}
+
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
+					struct elevator_queue *e);
+extern void elv_release_fq_data(struct elv_fq_data *efqd);
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_dispatched_request_fair(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_activate_rq_fair(struct request_queue *q, struct request *rq);
+extern void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
 extern void elv_put_ioq(struct io_queue *ioq);
+extern void elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+				pid_t pid, int is_sync);
+extern void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern void *elv_io_group_async_queue_prio(struct io_group *iog,
+						int ioprio_class, int ioprio);
+extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+static inline struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+static inline void elv_release_fq_data(struct elv_fq_data *efqd) {}
+
+static inline int
+elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+
+static inline void
+elv_activate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_deactivate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_removed(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_add(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_ioq_completed_request(struct request_queue *q, struct request *rq) {}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 2d511f9..ea4042e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,6 +53,15 @@ static const int elv_hash_shift = 6;
 #define ELV_HASH_ENTRIES	(1 << elv_hash_shift)
 #define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
 
+static inline struct elv_fq_data *elv_efqd(struct elevator_queue *eq)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return eq->efqd;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
@@ -187,7 +196,7 @@ static struct elevator_type *elevator_get(const char *name)
 static void *elevator_init_queue(struct request_queue *q,
 				 struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	return eq->ops->elevator_init_fn(q, eq);
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
@@ -239,8 +248,21 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	eq->efqd = elv_alloc_fq_data(q, eq);
+
+	if (!eq->efqd)
+		goto err;
+
+	if (elv_init_fq_data(q, eq))
+		goto err;
+#endif
 	return eq;
 err:
+	if (elv_efqd(eq))
+		elv_release_fq_data(elv_efqd(eq));
+	if (eq->hash)
+		kfree(eq->hash);
 	kfree(eq);
 	elevator_put(e);
 	return NULL;
@@ -252,6 +274,7 @@ static void elevator_release(struct kobject *kobj)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	elevator_put(e->elevator_type);
+	elv_release_fq_data(elv_efqd(e));
 	kfree(e->hash);
 	kfree(e);
 }
@@ -309,6 +332,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
@@ -438,6 +462,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -478,6 +503,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -545,6 +571,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -651,12 +678,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -755,13 +778,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -841,8 +863,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1138,3 +1163,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return elv_ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return elv_ioq_sched_queue(elv_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..36fc210 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -65,7 +65,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct noop_data *nd;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..7cff5f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -229,6 +229,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -236,6 +241,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..4414a61 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -27,8 +27,19 @@ typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
 
-typedef void *(elevator_init_fn) (struct request_queue *);
+typedef void *(elevator_init_fn) (struct request_queue *,
+					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +67,16 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +97,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +113,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data *efqd;
+#endif
 };
 
 /*
@@ -207,5 +235,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This is essentially a lot of CFQ logic moved into common layer so that other
IO schedulers can make use of that in hierarhical scheduling setup.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    3 +-
 block/as-iosched.c       |    2 +-
 block/blk.h              |    6 +
 block/cfq-iosched.c      |    2 +-
 block/deadline-iosched.c |    3 +-
 block/elevator-fq.c      |  985 ++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h      |  229 +++++++++++
 block/elevator.c         |   63 +++-
 block/noop-iosched.c     |    2 +-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   50 +++-
 12 files changed, 1330 insertions(+), 42 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index 19ff1e8..d545323 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
+			ioctl.o genhd.o scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..b90acbe 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1351,7 +1351,7 @@ static void as_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (as_data).
  */
-static void *as_init_queue(struct request_queue *q)
+static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct as_data *ad;
 
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..d05b4cf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,6 +1,8 @@
 #ifndef BLK_INTERNAL_H
 #define BLK_INTERNAL_H
 
+#include "elevator-fq.h"
+
 /* Amount of time in which a process may batch requests */
 #define BLK_BATCH_TIME	(HZ/50UL)
 
@@ -71,6 +73,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_activate_rq_fair(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +83,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_deactivate_rq_fair(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..5a67ec0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2448,7 +2448,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	kfree(cfqd);
 }
 
-static void *cfq_init_queue(struct request_queue *q)
+static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct cfq_data *cfqd;
 	int i;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..25af8b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -347,7 +347,8 @@ static void deadline_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static void *deadline_init_queue(struct request_queue *q)
+static void *
+deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index be7374d..1ca7b4a 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,14 +12,23 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/blktrace_api.h>
 #include "elevator-fq.h"
 
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+static struct kmem_cache *elv_ioq_pool;
+
 /*
  * offset from end of service tree
  */
 #define ELV_IDLE_DELAY		(HZ / 5)
 #define ELV_SLICE_SCALE		(500)
 #define ELV_SERVICE_SHIFT	20
+#define ELV_HW_QUEUE_MIN	(5)
+#define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
+				{ RB_ROOT, NULL, 0, NULL, 0})
 
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
@@ -98,7 +107,7 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
 
 static void update_min_vdisktime(struct io_service_tree *st)
 {
-	u64 vdisktime;
+	u64 vdisktime = st->min_vdisktime;
 
 	if (st->active_entity)
 		vdisktime = st->active_entity->vdisktime;
@@ -133,6 +142,12 @@ static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 	return ioq_of(entity)->efqd;
 }
 
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return ioq->efqd->root_group;
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
 static inline struct io_sched_data *
 io_entity_sched_data(struct io_entity *entity)
 {
@@ -238,7 +253,8 @@ static void dequeue_io_entity(struct io_entity *entity)
 }
 
 static void
-__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity,
+			int add_front)
 {
 	struct rb_node **node = &st->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -250,7 +266,8 @@ __enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
 		parent = *node;
 		entry = rb_entry(parent, struct io_entity, rb_node);
 
-		if (key < entity_key(st, entry)) {
+		if (key < entity_key(st, entry) ||
+			(add_front && (key == entity_key(st, entry)))) {
 			node = &parent->rb_left;
 		} else {
 			node = &parent->rb_right;
@@ -280,7 +297,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	sd->nr_active++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
-	__enqueue_io_entity(st, entity);
+	__enqueue_io_entity(st, entity, 0);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -310,6 +327,7 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 			__dequeue_io_entity(st, entity);
 			st->active_entity = entity;
 			sd->active_entity = entity;
+			update_min_vdisktime(entity->st);
 			break;
 		}
 	}
@@ -317,35 +335,37 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 	return entity;
 }
 
-static void requeue_io_entity(struct io_entity *entity)
+static void requeue_io_entity(struct io_entity *entity, int add_front)
 {
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
-	next_entity = __lookup_next_io_entity(st);
+	if (add_front) {
+		next_entity = __lookup_next_io_entity(st);
 
-	/*
-	 * This is to emulate cfq like functionality where preemption can
-	 * happen with-in same class, like sync queue preempting async queue
-	 * May be this is not a very good idea from fairness point of view
-	 * as preempting queue gains share. Keeping it for now.
-	 *
-	 * This feature is also used by cfq close cooperator functionlity
-	 * where cfq selects a queue out of order to run next based on
-	 * close cooperator.
-	 */
+		/*
+		 * This is to emulate cfq like functionality where preemption
+		 * can happen with-in same class, like sync queue preempting
+		 * async queue.
+		 *
+		 * This feature is also used by cfq close cooperator
+		 * functionlity where cfq selects a queue out of order to run
+		 * next based on close cooperator.
+		 */
 
-	if (next_entity && next_entity != entity) {
-		__dequeue_io_entity(st, entity);
-		place_entity(st, entity, 1);
-		__enqueue_io_entity(st, entity);
+		if (next_entity && next_entity == entity)
+			return;
 	}
+
+	__dequeue_io_entity(st, entity);
+	place_entity(st, entity, add_front);
+	__enqueue_io_entity(st, entity, add_front);
 }
 
-/* Requeue and ioq (already on the tree) to the front of service tree */
-static void requeue_ioq(struct io_queue *ioq)
+/* Requeue and ioq which is already on the tree */
+static void requeue_ioq(struct io_queue *ioq, int add_front)
 {
-	requeue_io_entity(&ioq->entity);
+	requeue_io_entity(&ioq->entity, add_front);
 }
 
 static void put_prev_io_entity(struct io_entity *entity)
@@ -360,7 +380,7 @@ static void put_prev_io_entity(struct io_entity *entity)
 		dequeue_io_entity(entity);
 		enqueue_io_entity(entity);
 	} else
-		__enqueue_io_entity(st, entity);
+		__enqueue_io_entity(st, entity, 0);
 }
 
 /* Put curr ioq back into rb tree. */
@@ -398,7 +418,924 @@ init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
 
 void elv_put_ioq(struct io_queue *ioq)
 {
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = efqd->eq;
+
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
 }
+EXPORT_SYMBOL(elv_put_ioq);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(q, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd->idle_slice_timer);
+	cancel_work_sync(&e->efqd->unplug_work);
+}
+
+static void elv_set_prio_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	ioq->slice_start = jiffies;
+	ioq->slice_end = elv_prio_to_slice(efqd, ioq) + jiffies;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->slice_end - jiffies);
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
+		int is_sync)
+{
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = eq->efqd;
+	ioq->pid = pid;
+
+	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+						int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(elv_io_group_async_queue_prio);
+
+void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_io_group_set_async_queue);
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+/*
+ * Should be called after ioq prio and class has been initialized as prio
+ * class data will be used to determine which service tree in the group
+ * entity should be attached to.
+ */
+void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog)
+{
+	init_io_entity_parent(&ioq->entity, &iog->entity);
+}
+EXPORT_SYMBOL(elv_init_ioq_io_group);
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = lookup_next_io_entity(sd);
+	if (!entity)
+		return NULL;
+
+	ioq = ioq_of(entity);
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void
+__elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+						efqd->busy_queues);
+		ioq->slice_start = ioq->slice_end = 0;
+		ioq->dispatch_start = jiffies;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq && eq->ops->elevator_active_ioq_set_fn)
+		eq->ops->elevator_active_ioq_set_fn(q, ioq->sched_queue, coop);
+}
+
+/* Get and set a new active queue for service. */
+static struct
+io_queue *elv_set_active_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	int coop = 0;
+
+	if (ioq) {
+		requeue_ioq(ioq, 1);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+
+	ioq = elv_get_next_ioq(q);
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_active_ioq_reset_fn)
+		eq->ops->elevator_active_ioq_reset_fn(q, ioq->sched_queue);
+
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	enqueue_ioq(ioq);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	dequeue_ioq(ioq);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Currently one should set fairness = 1 to force completion of requests
+ * from queue before dispatch from next queue starts. This should help in
+ * better time accounting at the expense of throughput.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	long slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * Queue got expired before even a single request completed or
+	 * got expired immediately after first request completion. Use
+	 * the time elapsed since queue was scheduled in.
+	 */
+	if (!ioq->slice_end || ioq->slice_start == jiffies) {
+		slice_used = jiffies - ioq->dispatch_start;
+		if (!slice_used)
+			slice_used = 1;
+		goto done;
+	}
+
+	slice_used = jiffies - ioq->slice_start;
+	if (time_after(jiffies, ioq->slice_end))
+		slice_overshoot = jiffies - ioq->slice_end;
+
+done:
+	elv_log_ioq(efqd, ioq, "disp_start = %lu sl_start= %lu sl_end=%lu,"
+			" jiffies=%lu", ioq->dispatch_start, ioq->slice_start,
+			ioq->slice_end, jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, overshoot=%ld sect=%lu",
+				slice_used, slice_overshoot, ioq->nr_sectors);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
+
+	put_prev_ioq(ioq);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq);
+	else if (!elv_ioq_sync(ioq)) {
+		/*
+		 * Requeue async ioq so that these will be again placed at
+		 * the end of service tree giving a chance to sync queues.
+		 */
+		requeue_ioq(ioq, 0);
+	}
+}
+EXPORT_SYMBOL(elv_ioq_slice_expired);
+
+/* Expire the ioq. */
+void elv_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn) {
+		void *sched_queue = elv_ioq_sched_queue(new_ioq);
+
+		return eq->ops->elevator_should_preempt_fn(q, sched_queue, rq);
+	}
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+	elv_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	requeue_ioq(ioq, 1);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	ioq->nr_queued++;
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
+				__blk_run_queue(q);
+			else
+				elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_arm_slice_timer_fn)
+		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+	void *sched_queue = ioq->sched_queue;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q, sched_queue);
+
+	if (new_ioq)
+		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	ioq->dispatched++;
+	ioq->nr_sectors += blk_rq_sectors(rq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_activate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+			ioq->nr_queued, efqd->rq_in_driver,
+			elv_ioq_nr_dispatched(ioq));
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_set_prio_slice(q->elevator->efqd, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = NULL;
+
+	efqd = kmalloc_node(sizeof(*efqd), GFP_KERNEL | __GFP_ZERO, q->node);
+	return efqd;
+}
+
+void elv_release_fq_data(struct elv_fq_data *efqd)
+{
+	kfree(efqd);
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+
+	/*
+	 * Our fallback ioq if elv_alloc_ioq() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	elv_init_ioq(e, &efqd->oom_ioq, 1, 0);
+	elv_get_ioq(&efqd->oom_ioq);
+	elv_init_ioq_io_group(&efqd->oom_ioq, iog);
+
+	efqd->queue = q;
+	efqd->eq = e;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 868e035..6d3809f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 #define IO_WEIGHT_DEFAULT	500
 #define IO_IOPRIO_CLASSES	3
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 struct io_service_tree {
 	struct rb_root active;
 	struct io_entity *active_entity;
@@ -61,23 +65,80 @@ struct io_queue {
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
+	/* time when dispatch from the queue was started */
+	unsigned long dispatch_start;
+	/* time when first request from queue completed and slice started. */
+	unsigned long slice_start;
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
 };
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	struct request_queue *queue;
+	struct elevator_queue *eq;
+	unsigned int busy_queues;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fallback dummy ioq for extreme OOM conditions */
+	struct io_queue oom_ioq;
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
 /* Some shared queue flag manipulation functions among elevators */
 
 enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 };
 
@@ -95,6 +156,11 @@ static inline int elv_ioq_##name(struct io_queue *ioq)         		\
 	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
 }
 
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
 static inline void elv_get_ioq(struct io_queue *ioq)
@@ -143,6 +209,169 @@ static inline int elv_ioq_ioprio(struct io_queue *ioq)
 	return ioq->entity.ioprio;
 }
 
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd->active_queue;
+}
+
+static inline void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return elv_ioq_sched_queue(elv_active_ioq(e));
+}
+
+static inline int elv_rq_in_driver(struct elevator_queue *e)
+{
+	return e->efqd->rq_in_driver;
+}
+
+static inline int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd->busy_queues;
+}
+
+/* Helper functions for operating on elevator idle slice timer */
+static inline int
+elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	return mod_timer(&eq->efqd->idle_slice_timer, expires);
+}
+
+static inline int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	return del_timer(&eq->efqd->idle_slice_timer);
+}
+
+static inline void
+elv_init_ioq_sched_queue(struct elevator_queue *eq, struct io_queue *ioq,
+					void *sched_queue)
+{
+	ioq->sched_queue = sched_queue;
+}
+
+static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
+{
+	return &eq->efqd->oom_ioq;
+}
+
+static inline struct io_group *
+elv_io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd->root_group;
+}
+
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
+					struct elevator_queue *e);
+extern void elv_release_fq_data(struct elv_fq_data *efqd);
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_dispatched_request_fair(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_activate_rq_fair(struct request_queue *q, struct request *rq);
+extern void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
 extern void elv_put_ioq(struct io_queue *ioq);
+extern void elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+				pid_t pid, int is_sync);
+extern void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern void *elv_io_group_async_queue_prio(struct io_group *iog,
+						int ioprio_class, int ioprio);
+extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+static inline struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+static inline void elv_release_fq_data(struct elv_fq_data *efqd) {}
+
+static inline int
+elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+
+static inline void
+elv_activate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_deactivate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_removed(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_add(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_ioq_completed_request(struct request_queue *q, struct request *rq) {}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 2d511f9..ea4042e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,6 +53,15 @@ static const int elv_hash_shift = 6;
 #define ELV_HASH_ENTRIES	(1 << elv_hash_shift)
 #define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
 
+static inline struct elv_fq_data *elv_efqd(struct elevator_queue *eq)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return eq->efqd;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
@@ -187,7 +196,7 @@ static struct elevator_type *elevator_get(const char *name)
 static void *elevator_init_queue(struct request_queue *q,
 				 struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	return eq->ops->elevator_init_fn(q, eq);
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
@@ -239,8 +248,21 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	eq->efqd = elv_alloc_fq_data(q, eq);
+
+	if (!eq->efqd)
+		goto err;
+
+	if (elv_init_fq_data(q, eq))
+		goto err;
+#endif
 	return eq;
 err:
+	if (elv_efqd(eq))
+		elv_release_fq_data(elv_efqd(eq));
+	if (eq->hash)
+		kfree(eq->hash);
 	kfree(eq);
 	elevator_put(e);
 	return NULL;
@@ -252,6 +274,7 @@ static void elevator_release(struct kobject *kobj)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	elevator_put(e->elevator_type);
+	elv_release_fq_data(elv_efqd(e));
 	kfree(e->hash);
 	kfree(e);
 }
@@ -309,6 +332,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
@@ -438,6 +462,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -478,6 +503,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -545,6 +571,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -651,12 +678,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -755,13 +778,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -841,8 +863,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1138,3 +1163,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return elv_ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return elv_ioq_sched_queue(elv_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..36fc210 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -65,7 +65,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct noop_data *nd;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..7cff5f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -229,6 +229,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -236,6 +241,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..4414a61 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -27,8 +27,19 @@ typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
 
-typedef void *(elevator_init_fn) (struct request_queue *);
+typedef void *(elevator_init_fn) (struct request_queue *,
+					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +67,16 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +97,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +113,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data *efqd;
+#endif
 };
 
 /*
@@ -207,5 +235,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This is essentially a lot of CFQ logic moved into common layer so that other
IO schedulers can make use of that in hierarhical scheduling setup.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    3 +-
 block/as-iosched.c       |    2 +-
 block/blk.h              |    6 +
 block/cfq-iosched.c      |    2 +-
 block/deadline-iosched.c |    3 +-
 block/elevator-fq.c      |  985 ++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h      |  229 +++++++++++
 block/elevator.c         |   63 +++-
 block/noop-iosched.c     |    2 +-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   50 +++-
 12 files changed, 1330 insertions(+), 42 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index 19ff1e8..d545323 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
+			ioctl.o genhd.o scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..b90acbe 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1351,7 +1351,7 @@ static void as_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (as_data).
  */
-static void *as_init_queue(struct request_queue *q)
+static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct as_data *ad;
 
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..d05b4cf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,6 +1,8 @@
 #ifndef BLK_INTERNAL_H
 #define BLK_INTERNAL_H
 
+#include "elevator-fq.h"
+
 /* Amount of time in which a process may batch requests */
 #define BLK_BATCH_TIME	(HZ/50UL)
 
@@ -71,6 +73,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_activate_rq_fair(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +83,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_deactivate_rq_fair(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..5a67ec0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2448,7 +2448,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	kfree(cfqd);
 }
 
-static void *cfq_init_queue(struct request_queue *q)
+static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct cfq_data *cfqd;
 	int i;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..25af8b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -347,7 +347,8 @@ static void deadline_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static void *deadline_init_queue(struct request_queue *q)
+static void *
+deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index be7374d..1ca7b4a 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,14 +12,23 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/blktrace_api.h>
 #include "elevator-fq.h"
 
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+static struct kmem_cache *elv_ioq_pool;
+
 /*
  * offset from end of service tree
  */
 #define ELV_IDLE_DELAY		(HZ / 5)
 #define ELV_SLICE_SCALE		(500)
 #define ELV_SERVICE_SHIFT	20
+#define ELV_HW_QUEUE_MIN	(5)
+#define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
+				{ RB_ROOT, NULL, 0, NULL, 0})
 
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
@@ -98,7 +107,7 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
 
 static void update_min_vdisktime(struct io_service_tree *st)
 {
-	u64 vdisktime;
+	u64 vdisktime = st->min_vdisktime;
 
 	if (st->active_entity)
 		vdisktime = st->active_entity->vdisktime;
@@ -133,6 +142,12 @@ static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 	return ioq_of(entity)->efqd;
 }
 
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return ioq->efqd->root_group;
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
 static inline struct io_sched_data *
 io_entity_sched_data(struct io_entity *entity)
 {
@@ -238,7 +253,8 @@ static void dequeue_io_entity(struct io_entity *entity)
 }
 
 static void
-__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity,
+			int add_front)
 {
 	struct rb_node **node = &st->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -250,7 +266,8 @@ __enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity)
 		parent = *node;
 		entry = rb_entry(parent, struct io_entity, rb_node);
 
-		if (key < entity_key(st, entry)) {
+		if (key < entity_key(st, entry) ||
+			(add_front && (key == entity_key(st, entry)))) {
 			node = &parent->rb_left;
 		} else {
 			node = &parent->rb_right;
@@ -280,7 +297,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	sd->nr_active++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
-	__enqueue_io_entity(st, entity);
+	__enqueue_io_entity(st, entity, 0);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -310,6 +327,7 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 			__dequeue_io_entity(st, entity);
 			st->active_entity = entity;
 			sd->active_entity = entity;
+			update_min_vdisktime(entity->st);
 			break;
 		}
 	}
@@ -317,35 +335,37 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
 	return entity;
 }
 
-static void requeue_io_entity(struct io_entity *entity)
+static void requeue_io_entity(struct io_entity *entity, int add_front)
 {
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
-	next_entity = __lookup_next_io_entity(st);
+	if (add_front) {
+		next_entity = __lookup_next_io_entity(st);
 
-	/*
-	 * This is to emulate cfq like functionality where preemption can
-	 * happen with-in same class, like sync queue preempting async queue
-	 * May be this is not a very good idea from fairness point of view
-	 * as preempting queue gains share. Keeping it for now.
-	 *
-	 * This feature is also used by cfq close cooperator functionlity
-	 * where cfq selects a queue out of order to run next based on
-	 * close cooperator.
-	 */
+		/*
+		 * This is to emulate cfq like functionality where preemption
+		 * can happen with-in same class, like sync queue preempting
+		 * async queue.
+		 *
+		 * This feature is also used by cfq close cooperator
+		 * functionlity where cfq selects a queue out of order to run
+		 * next based on close cooperator.
+		 */
 
-	if (next_entity && next_entity != entity) {
-		__dequeue_io_entity(st, entity);
-		place_entity(st, entity, 1);
-		__enqueue_io_entity(st, entity);
+		if (next_entity && next_entity == entity)
+			return;
 	}
+
+	__dequeue_io_entity(st, entity);
+	place_entity(st, entity, add_front);
+	__enqueue_io_entity(st, entity, add_front);
 }
 
-/* Requeue and ioq (already on the tree) to the front of service tree */
-static void requeue_ioq(struct io_queue *ioq)
+/* Requeue and ioq which is already on the tree */
+static void requeue_ioq(struct io_queue *ioq, int add_front)
 {
-	requeue_io_entity(&ioq->entity);
+	requeue_io_entity(&ioq->entity, add_front);
 }
 
 static void put_prev_io_entity(struct io_entity *entity)
@@ -360,7 +380,7 @@ static void put_prev_io_entity(struct io_entity *entity)
 		dequeue_io_entity(entity);
 		enqueue_io_entity(entity);
 	} else
-		__enqueue_io_entity(st, entity);
+		__enqueue_io_entity(st, entity, 0);
 }
 
 /* Put curr ioq back into rb tree. */
@@ -398,7 +418,924 @@ init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
 
 void elv_put_ioq(struct io_queue *ioq)
 {
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = efqd->eq;
+
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
 }
+EXPORT_SYMBOL(elv_put_ioq);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(q, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd->idle_slice_timer);
+	cancel_work_sync(&e->efqd->unplug_work);
+}
+
+static void elv_set_prio_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	ioq->slice_start = jiffies;
+	ioq->slice_end = elv_prio_to_slice(efqd, ioq) + jiffies;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->slice_end - jiffies);
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
+		int is_sync)
+{
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = eq->efqd;
+	ioq->pid = pid;
+
+	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+						int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(elv_io_group_async_queue_prio);
+
+void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_io_group_set_async_queue);
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+/*
+ * Should be called after ioq prio and class has been initialized as prio
+ * class data will be used to determine which service tree in the group
+ * entity should be attached to.
+ */
+void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog)
+{
+	init_io_entity_parent(&ioq->entity, &iog->entity);
+}
+EXPORT_SYMBOL(elv_init_ioq_io_group);
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = lookup_next_io_entity(sd);
+	if (!entity)
+		return NULL;
+
+	ioq = ioq_of(entity);
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void
+__elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+						efqd->busy_queues);
+		ioq->slice_start = ioq->slice_end = 0;
+		ioq->dispatch_start = jiffies;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq && eq->ops->elevator_active_ioq_set_fn)
+		eq->ops->elevator_active_ioq_set_fn(q, ioq->sched_queue, coop);
+}
+
+/* Get and set a new active queue for service. */
+static struct
+io_queue *elv_set_active_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	int coop = 0;
+
+	if (ioq) {
+		requeue_ioq(ioq, 1);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+
+	ioq = elv_get_next_ioq(q);
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_active_ioq_reset_fn)
+		eq->ops->elevator_active_ioq_reset_fn(q, ioq->sched_queue);
+
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	enqueue_ioq(ioq);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	dequeue_ioq(ioq);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Currently one should set fairness = 1 to force completion of requests
+ * from queue before dispatch from next queue starts. This should help in
+ * better time accounting at the expense of throughput.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	long slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * Queue got expired before even a single request completed or
+	 * got expired immediately after first request completion. Use
+	 * the time elapsed since queue was scheduled in.
+	 */
+	if (!ioq->slice_end || ioq->slice_start == jiffies) {
+		slice_used = jiffies - ioq->dispatch_start;
+		if (!slice_used)
+			slice_used = 1;
+		goto done;
+	}
+
+	slice_used = jiffies - ioq->slice_start;
+	if (time_after(jiffies, ioq->slice_end))
+		slice_overshoot = jiffies - ioq->slice_end;
+
+done:
+	elv_log_ioq(efqd, ioq, "disp_start = %lu sl_start= %lu sl_end=%lu,"
+			" jiffies=%lu", ioq->dispatch_start, ioq->slice_start,
+			ioq->slice_end, jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, overshoot=%ld sect=%lu",
+				slice_used, slice_overshoot, ioq->nr_sectors);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
+
+	put_prev_ioq(ioq);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq);
+	else if (!elv_ioq_sync(ioq)) {
+		/*
+		 * Requeue async ioq so that these will be again placed at
+		 * the end of service tree giving a chance to sync queues.
+		 */
+		requeue_ioq(ioq, 0);
+	}
+}
+EXPORT_SYMBOL(elv_ioq_slice_expired);
+
+/* Expire the ioq. */
+void elv_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn) {
+		void *sched_queue = elv_ioq_sched_queue(new_ioq);
+
+		return eq->ops->elevator_should_preempt_fn(q, sched_queue, rq);
+	}
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+	elv_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	requeue_ioq(ioq, 1);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	ioq->nr_queued++;
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
+				__blk_run_queue(q);
+			else
+				elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elevator_queue *eq = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(eq);
+
+	if (eq->ops->elevator_arm_slice_timer_fn)
+		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+	void *sched_queue = ioq->sched_queue;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q, sched_queue);
+
+	if (new_ioq)
+		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	ioq->dispatched++;
+	ioq->nr_sectors += blk_rq_sectors(rq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_activate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+			ioq->nr_queued, efqd->rq_in_driver,
+			elv_ioq_nr_dispatched(ioq));
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_set_prio_slice(q->elevator->efqd, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = NULL;
+
+	efqd = kmalloc_node(sizeof(*efqd), GFP_KERNEL | __GFP_ZERO, q->node);
+	return efqd;
+}
+
+void elv_release_fq_data(struct elv_fq_data *efqd)
+{
+	kfree(efqd);
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+
+	/*
+	 * Our fallback ioq if elv_alloc_ioq() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	elv_init_ioq(e, &efqd->oom_ioq, 1, 0);
+	elv_get_ioq(&efqd->oom_ioq);
+	elv_init_ioq_io_group(&efqd->oom_ioq, iog);
+
+	efqd->queue = q;
+	efqd->eq = e;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 868e035..6d3809f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 #define IO_WEIGHT_DEFAULT	500
 #define IO_IOPRIO_CLASSES	3
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 struct io_service_tree {
 	struct rb_root active;
 	struct io_entity *active_entity;
@@ -61,23 +65,80 @@ struct io_queue {
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
+	/* time when dispatch from the queue was started */
+	unsigned long dispatch_start;
+	/* time when first request from queue completed and slice started. */
+	unsigned long slice_start;
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
 };
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	struct request_queue *queue;
+	struct elevator_queue *eq;
+	unsigned int busy_queues;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fallback dummy ioq for extreme OOM conditions */
+	struct io_queue oom_ioq;
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
 /* Some shared queue flag manipulation functions among elevators */
 
 enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 };
 
@@ -95,6 +156,11 @@ static inline int elv_ioq_##name(struct io_queue *ioq)         		\
 	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
 }
 
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
 static inline void elv_get_ioq(struct io_queue *ioq)
@@ -143,6 +209,169 @@ static inline int elv_ioq_ioprio(struct io_queue *ioq)
 	return ioq->entity.ioprio;
 }
 
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd->active_queue;
+}
+
+static inline void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return elv_ioq_sched_queue(elv_active_ioq(e));
+}
+
+static inline int elv_rq_in_driver(struct elevator_queue *e)
+{
+	return e->efqd->rq_in_driver;
+}
+
+static inline int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd->busy_queues;
+}
+
+/* Helper functions for operating on elevator idle slice timer */
+static inline int
+elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	return mod_timer(&eq->efqd->idle_slice_timer, expires);
+}
+
+static inline int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	return del_timer(&eq->efqd->idle_slice_timer);
+}
+
+static inline void
+elv_init_ioq_sched_queue(struct elevator_queue *eq, struct io_queue *ioq,
+					void *sched_queue)
+{
+	ioq->sched_queue = sched_queue;
+}
+
+static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
+{
+	return &eq->efqd->oom_ioq;
+}
+
+static inline struct io_group *
+elv_io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd->root_group;
+}
+
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
+					struct elevator_queue *e);
+extern void elv_release_fq_data(struct elv_fq_data *efqd);
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_dispatched_request_fair(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_activate_rq_fair(struct request_queue *q, struct request *rq);
+extern void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
 extern void elv_put_ioq(struct io_queue *ioq);
+extern void elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+				pid_t pid, int is_sync);
+extern void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern void *elv_io_group_async_queue_prio(struct io_group *iog,
+						int ioprio_class, int ioprio);
+extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+static inline struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+static inline void elv_release_fq_data(struct elv_fq_data *efqd) {}
+
+static inline int
+elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+
+static inline void
+elv_activate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_deactivate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_removed(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_add(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_ioq_completed_request(struct request_queue *q, struct request *rq) {}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 2d511f9..ea4042e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,6 +53,15 @@ static const int elv_hash_shift = 6;
 #define ELV_HASH_ENTRIES	(1 << elv_hash_shift)
 #define rq_hash_key(rq)		(blk_rq_pos(rq) + blk_rq_sectors(rq))
 
+static inline struct elv_fq_data *elv_efqd(struct elevator_queue *eq)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return eq->efqd;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
@@ -187,7 +196,7 @@ static struct elevator_type *elevator_get(const char *name)
 static void *elevator_init_queue(struct request_queue *q,
 				 struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	return eq->ops->elevator_init_fn(q, eq);
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
@@ -239,8 +248,21 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	eq->efqd = elv_alloc_fq_data(q, eq);
+
+	if (!eq->efqd)
+		goto err;
+
+	if (elv_init_fq_data(q, eq))
+		goto err;
+#endif
 	return eq;
 err:
+	if (elv_efqd(eq))
+		elv_release_fq_data(elv_efqd(eq));
+	if (eq->hash)
+		kfree(eq->hash);
 	kfree(eq);
 	elevator_put(e);
 	return NULL;
@@ -252,6 +274,7 @@ static void elevator_release(struct kobject *kobj)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	elevator_put(e->elevator_type);
+	elv_release_fq_data(elv_efqd(e));
 	kfree(e->hash);
 	kfree(e);
 }
@@ -309,6 +332,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
@@ -438,6 +462,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -478,6 +503,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_dispatched_request_fair(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -545,6 +571,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -651,12 +678,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -755,13 +778,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -841,8 +863,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1138,3 +1163,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return elv_ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return elv_ioq_sched_queue(elv_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..36fc210 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -65,7 +65,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct noop_data *nd;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..7cff5f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -229,6 +229,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -236,6 +241,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..4414a61 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -27,8 +27,19 @@ typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
 
-typedef void *(elevator_init_fn) (struct request_queue *);
+typedef void *(elevator_init_fn) (struct request_queue *,
+					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +67,16 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +97,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +113,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data *efqd;
+#endif
 };
 
 /*
@@ -207,5 +235,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
                     ` (26 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   |  980 +++++++++++--------------------------------------
 2 files changed, 217 insertions(+), 766 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5a67ec0..4bde1c8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * tunables
@@ -23,17 +24,10 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
 /*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
-
-/*
  * below this threshold, we consider thinktime immediate
  */
 #define CFQ_MIN_TT		(2)
@@ -43,7 +37,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (elv_ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +47,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -74,16 +66,11 @@ struct cfq_rb_root {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -99,18 +86,13 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio, org_ioprio_class;
 
 	pid_t pid;
 };
@@ -120,12 +102,6 @@ struct cfq_queue {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -133,14 +109,6 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
 	/*
@@ -151,21 +119,8 @@ struct cfq_data {
 	int hw_tag_samples;
 	int rq_in_driver_peak;
 
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -175,7 +130,6 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
@@ -188,16 +142,10 @@ struct cfq_data {
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -215,16 +163,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -263,66 +205,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
-}
-
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
-{
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -421,33 +324,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -474,95 +350,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -624,57 +411,43 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
+
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	/*
+	 * If queue was selected because it was a close cooperator, then
+	 * mark it so that it is not selected again and again. Otherwise
+	 * clear the coop flag so that it becomes eligible to get selected
+	 * again.
+	 */
+	if (coop)
+		cfq_mark_cfqq_coop(cfqq);
+	else
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -683,7 +456,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -691,8 +463,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -710,9 +491,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -720,7 +498,9 @@ static void cfq_add_rq_rb(struct request *rq)
 	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
 
 	/*
-	 * adjust priority tree position, if ->next_rq changes
+	 * adjust priority tree position, if ->next_rq changes. This should
+	 * also take care of adding a new queue to prio tree as if this is
+	 * first request then prev would be null and cfqq->next_rq will not.
 	 */
 	if (prev != cfqq->next_rq)
 		cfq_prio_tree_add(cfqd, cfqq);
@@ -760,23 +540,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -861,93 +627,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1024,11 +718,11 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
-					      int probe)
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1049,14 +743,13 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
-	if (!probe)
-		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
@@ -1069,18 +762,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 		return;
 
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
+	if (!cfqd->cfq_slice_idle || !elv_ioq_idle_window(cfqq->ioq))
 		return;
 
 	/*
 	 * still requests with the driver, don't idle
 	 */
-	if (cfqd->rq_in_driver)
+	if (elv_rq_in_driver(q->elevator))
 		return;
 
 	/*
@@ -1090,7 +783,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
+	elv_mark_ioq_wait_request(cfqq->ioq);
 
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
@@ -1101,7 +794,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1113,10 +806,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1154,78 +846,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1250,12 +875,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1301,13 +928,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1324,7 +948,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1334,13 +958,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1349,51 +973,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1481,9 +1099,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1571,7 +1189,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1584,30 +1202,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1649,19 +1270,17 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			  pid_t pid, int is_sync)
 {
-	RB_CLEAR_NODE(&cfqq->rb_node);
 	RB_CLEAR_NODE(&cfqq->p_node);
 	INIT_LIST_HEAD(&cfqq->fifo);
 
-	atomic_set(&cfqq->ref, 0);
 	cfqq->cfqd = cfqd;
 
 	cfq_mark_cfqq_prio_changed(cfqq);
 
 	if (is_sync) {
 		if (!cfq_class_idle(cfqq))
-			cfq_mark_cfqq_idle_window(cfqq);
-		cfq_mark_cfqq_sync(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
+		elv_mark_ioq_sync(cfqq->ioq);
 	}
 	cfqq->pid = pid;
 }
@@ -1672,8 +1291,13 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 
 retry:
+	iog = elv_io_get_io_group(q, 0);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -1683,8 +1307,29 @@ retry:
 	 * originally, since it should just be a temporary situation.
 	 */
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
+		/* Allocate ioq object first and then cfqq */
+		if (new_ioq) {
+			goto alloc_cfqq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(cfqd->queue->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			spin_lock_irq(cfqd->queue->queue_lock);
+			if (new_ioq)
+				goto retry;
+		} else
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+
+alloc_cfqq:
+		if (!ioq && !new_ioq) {
+			/* ioq allocation failed. Deafult to oom_cfqq */
+			cfqq = &cfqd->oom_cfqq;
+			goto out;
+		}
+
 		cfqq = NULL;
 		if (new_cfqq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
 		} else if (gfp_mask & __GFP_WAIT) {
@@ -1702,60 +1347,59 @@ retry:
 		}
 
 		if (cfqq) {
+			elv_init_ioq(q->elevator, ioq, current->pid, is_sync);
+			elv_init_ioq_sched_queue(q->elevator, ioq, cfqq);
+
+			cfqq->ioq = ioq;
 			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
 			cfq_init_prio_data(cfqq, ioc);
+
+			/* call it after cfq has initialized queue prio */
+			elv_init_ioq_io_group(ioq, iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
-		} else
+		} else {
 			cfqq = &cfqd->oom_cfqq;
+			/* If ioq allocation was successful, free it up */
+			if (ioq)
+				elv_free_ioq(ioq);
+		}
 	}
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+out:
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	      gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq)
 		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
+							cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1960,7 +1604,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
+	enable_idle = old_idle = elv_ioq_idle_window(cfqq->ioq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
 	    (cfqd->hw_tag && CIC_SEEKY(cic)))
@@ -1975,9 +1619,9 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (old_idle != enable_idle) {
 		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
 		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
 		else
-			cfq_clear_cfqq_idle_window(cfqq);
+			elv_clear_ioq_idle_window(cfqq->ioq);
 	}
 }
 
@@ -1986,16 +1630,15 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  * no or if we aren't sure, a 1 will cause a preempt.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2018,13 +1661,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2038,27 +1675,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
  */
@@ -2077,36 +1693,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2130,11 +1716,13 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_update_hw_tag(struct cfq_data *cfqd)
 {
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+	struct elevator_queue *eq = cfqd->queue->elevator;
+
+	if (elv_rq_in_driver(eq) > cfqd->rq_in_driver_peak)
+		cfqd->rq_in_driver_peak = elv_rq_in_driver(eq);
 
 	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+	    elv_rq_in_driver(eq) <= CFQ_HW_QUEUE_MIN)
 		return;
 
 	if (cfqd->hw_tag_samples++ < 50)
@@ -2161,44 +1749,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2207,29 +1761,32 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
+	if ((elv_ioq_wait_request(cfqq->ioq) || cfq_cfqq_must_alloc(cfqq)) &&
 	    !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
@@ -2282,7 +1839,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2318,119 +1875,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2439,12 +1908,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2457,8 +1921,6 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,25 +1935,20 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	 * will not attempt to free it.
 	 */
 	cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
-	atomic_inc(&cfqd->oom_cfqq.ref);
+
+	/* Link up oom_ioq and oom_cfqq */
+	cfqd->oom_cfqq.ioq = elv_get_oom_ioq(eq);
+	elv_init_ioq_sched_queue(eq, elv_get_oom_ioq(eq), &cfqd->oom_cfqq);
 
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
@@ -2560,8 +2017,6 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
 SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2590,8 +2045,6 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2605,10 +2058,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2621,8 +2074,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2632,7 +2083,14 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2640,14 +2098,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   |  980 +++++++++++--------------------------------------
 2 files changed, 217 insertions(+), 766 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5a67ec0..4bde1c8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * tunables
@@ -23,17 +24,10 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
 /*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
-
-/*
  * below this threshold, we consider thinktime immediate
  */
 #define CFQ_MIN_TT		(2)
@@ -43,7 +37,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (elv_ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +47,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -74,16 +66,11 @@ struct cfq_rb_root {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -99,18 +86,13 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio, org_ioprio_class;
 
 	pid_t pid;
 };
@@ -120,12 +102,6 @@ struct cfq_queue {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -133,14 +109,6 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
 	/*
@@ -151,21 +119,8 @@ struct cfq_data {
 	int hw_tag_samples;
 	int rq_in_driver_peak;
 
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -175,7 +130,6 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
@@ -188,16 +142,10 @@ struct cfq_data {
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -215,16 +163,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -263,66 +205,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
-}
-
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
-{
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -421,33 +324,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -474,95 +350,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -624,57 +411,43 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
+
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	/*
+	 * If queue was selected because it was a close cooperator, then
+	 * mark it so that it is not selected again and again. Otherwise
+	 * clear the coop flag so that it becomes eligible to get selected
+	 * again.
+	 */
+	if (coop)
+		cfq_mark_cfqq_coop(cfqq);
+	else
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -683,7 +456,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -691,8 +463,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -710,9 +491,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -720,7 +498,9 @@ static void cfq_add_rq_rb(struct request *rq)
 	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
 
 	/*
-	 * adjust priority tree position, if ->next_rq changes
+	 * adjust priority tree position, if ->next_rq changes. This should
+	 * also take care of adding a new queue to prio tree as if this is
+	 * first request then prev would be null and cfqq->next_rq will not.
 	 */
 	if (prev != cfqq->next_rq)
 		cfq_prio_tree_add(cfqd, cfqq);
@@ -760,23 +540,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -861,93 +627,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1024,11 +718,11 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
-					      int probe)
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1049,14 +743,13 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
-	if (!probe)
-		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
@@ -1069,18 +762,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 		return;
 
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
+	if (!cfqd->cfq_slice_idle || !elv_ioq_idle_window(cfqq->ioq))
 		return;
 
 	/*
 	 * still requests with the driver, don't idle
 	 */
-	if (cfqd->rq_in_driver)
+	if (elv_rq_in_driver(q->elevator))
 		return;
 
 	/*
@@ -1090,7 +783,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
+	elv_mark_ioq_wait_request(cfqq->ioq);
 
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
@@ -1101,7 +794,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1113,10 +806,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1154,78 +846,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1250,12 +875,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1301,13 +928,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1324,7 +948,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1334,13 +958,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1349,51 +973,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1481,9 +1099,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1571,7 +1189,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1584,30 +1202,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1649,19 +1270,17 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			  pid_t pid, int is_sync)
 {
-	RB_CLEAR_NODE(&cfqq->rb_node);
 	RB_CLEAR_NODE(&cfqq->p_node);
 	INIT_LIST_HEAD(&cfqq->fifo);
 
-	atomic_set(&cfqq->ref, 0);
 	cfqq->cfqd = cfqd;
 
 	cfq_mark_cfqq_prio_changed(cfqq);
 
 	if (is_sync) {
 		if (!cfq_class_idle(cfqq))
-			cfq_mark_cfqq_idle_window(cfqq);
-		cfq_mark_cfqq_sync(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
+		elv_mark_ioq_sync(cfqq->ioq);
 	}
 	cfqq->pid = pid;
 }
@@ -1672,8 +1291,13 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 
 retry:
+	iog = elv_io_get_io_group(q, 0);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -1683,8 +1307,29 @@ retry:
 	 * originally, since it should just be a temporary situation.
 	 */
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
+		/* Allocate ioq object first and then cfqq */
+		if (new_ioq) {
+			goto alloc_cfqq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(cfqd->queue->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			spin_lock_irq(cfqd->queue->queue_lock);
+			if (new_ioq)
+				goto retry;
+		} else
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+
+alloc_cfqq:
+		if (!ioq && !new_ioq) {
+			/* ioq allocation failed. Deafult to oom_cfqq */
+			cfqq = &cfqd->oom_cfqq;
+			goto out;
+		}
+
 		cfqq = NULL;
 		if (new_cfqq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
 		} else if (gfp_mask & __GFP_WAIT) {
@@ -1702,60 +1347,59 @@ retry:
 		}
 
 		if (cfqq) {
+			elv_init_ioq(q->elevator, ioq, current->pid, is_sync);
+			elv_init_ioq_sched_queue(q->elevator, ioq, cfqq);
+
+			cfqq->ioq = ioq;
 			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
 			cfq_init_prio_data(cfqq, ioc);
+
+			/* call it after cfq has initialized queue prio */
+			elv_init_ioq_io_group(ioq, iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
-		} else
+		} else {
 			cfqq = &cfqd->oom_cfqq;
+			/* If ioq allocation was successful, free it up */
+			if (ioq)
+				elv_free_ioq(ioq);
+		}
 	}
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+out:
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	      gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq)
 		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
+							cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1960,7 +1604,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
+	enable_idle = old_idle = elv_ioq_idle_window(cfqq->ioq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
 	    (cfqd->hw_tag && CIC_SEEKY(cic)))
@@ -1975,9 +1619,9 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (old_idle != enable_idle) {
 		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
 		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
 		else
-			cfq_clear_cfqq_idle_window(cfqq);
+			elv_clear_ioq_idle_window(cfqq->ioq);
 	}
 }
 
@@ -1986,16 +1630,15 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  * no or if we aren't sure, a 1 will cause a preempt.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2018,13 +1661,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2038,27 +1675,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
  */
@@ -2077,36 +1693,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2130,11 +1716,13 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_update_hw_tag(struct cfq_data *cfqd)
 {
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+	struct elevator_queue *eq = cfqd->queue->elevator;
+
+	if (elv_rq_in_driver(eq) > cfqd->rq_in_driver_peak)
+		cfqd->rq_in_driver_peak = elv_rq_in_driver(eq);
 
 	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+	    elv_rq_in_driver(eq) <= CFQ_HW_QUEUE_MIN)
 		return;
 
 	if (cfqd->hw_tag_samples++ < 50)
@@ -2161,44 +1749,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2207,29 +1761,32 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
+	if ((elv_ioq_wait_request(cfqq->ioq) || cfq_cfqq_must_alloc(cfqq)) &&
 	    !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
@@ -2282,7 +1839,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2318,119 +1875,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2439,12 +1908,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2457,8 +1921,6 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,25 +1935,20 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	 * will not attempt to free it.
 	 */
 	cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
-	atomic_inc(&cfqd->oom_cfqq.ref);
+
+	/* Link up oom_ioq and oom_cfqq */
+	cfqd->oom_cfqq.ioq = elv_get_oom_ioq(eq);
+	elv_init_ioq_sched_queue(eq, elv_get_oom_ioq(eq), &cfqd->oom_cfqq);
 
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
@@ -2560,8 +2017,6 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
 SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2590,8 +2045,6 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2605,10 +2058,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2621,8 +2074,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2632,7 +2083,14 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2640,14 +2098,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   |  980 +++++++++++--------------------------------------
 2 files changed, 217 insertions(+), 766 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5a67ec0..4bde1c8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * tunables
@@ -23,17 +24,10 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
 /*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
-
-/*
  * below this threshold, we consider thinktime immediate
  */
 #define CFQ_MIN_TT		(2)
@@ -43,7 +37,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (elv_ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +47,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -74,16 +66,11 @@ struct cfq_rb_root {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -99,18 +86,13 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio, org_ioprio_class;
 
 	pid_t pid;
 };
@@ -120,12 +102,6 @@ struct cfq_queue {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -133,14 +109,6 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
 	/*
@@ -151,21 +119,8 @@ struct cfq_data {
 	int hw_tag_samples;
 	int rq_in_driver_peak;
 
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -175,7 +130,6 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
@@ -188,16 +142,10 @@ struct cfq_data {
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -215,16 +163,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -263,66 +205,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
-}
-
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
-{
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -421,33 +324,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -474,95 +350,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -624,57 +411,43 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
+
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	/*
+	 * If queue was selected because it was a close cooperator, then
+	 * mark it so that it is not selected again and again. Otherwise
+	 * clear the coop flag so that it becomes eligible to get selected
+	 * again.
+	 */
+	if (coop)
+		cfq_mark_cfqq_coop(cfqq);
+	else
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -683,7 +456,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -691,8 +463,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -710,9 +491,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -720,7 +498,9 @@ static void cfq_add_rq_rb(struct request *rq)
 	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
 
 	/*
-	 * adjust priority tree position, if ->next_rq changes
+	 * adjust priority tree position, if ->next_rq changes. This should
+	 * also take care of adding a new queue to prio tree as if this is
+	 * first request then prev would be null and cfqq->next_rq will not.
 	 */
 	if (prev != cfqq->next_rq)
 		cfq_prio_tree_add(cfqd, cfqq);
@@ -760,23 +540,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -861,93 +627,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1024,11 +718,11 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
-					      int probe)
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1049,14 +743,13 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
-	if (!probe)
-		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
@@ -1069,18 +762,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 		return;
 
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
+	if (!cfqd->cfq_slice_idle || !elv_ioq_idle_window(cfqq->ioq))
 		return;
 
 	/*
 	 * still requests with the driver, don't idle
 	 */
-	if (cfqd->rq_in_driver)
+	if (elv_rq_in_driver(q->elevator))
 		return;
 
 	/*
@@ -1090,7 +783,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
+	elv_mark_ioq_wait_request(cfqq->ioq);
 
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
@@ -1101,7 +794,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1113,10 +806,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1154,78 +846,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1250,12 +875,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1301,13 +928,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1324,7 +948,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1334,13 +958,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1349,51 +973,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1481,9 +1099,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1571,7 +1189,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1584,30 +1202,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1649,19 +1270,17 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			  pid_t pid, int is_sync)
 {
-	RB_CLEAR_NODE(&cfqq->rb_node);
 	RB_CLEAR_NODE(&cfqq->p_node);
 	INIT_LIST_HEAD(&cfqq->fifo);
 
-	atomic_set(&cfqq->ref, 0);
 	cfqq->cfqd = cfqd;
 
 	cfq_mark_cfqq_prio_changed(cfqq);
 
 	if (is_sync) {
 		if (!cfq_class_idle(cfqq))
-			cfq_mark_cfqq_idle_window(cfqq);
-		cfq_mark_cfqq_sync(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
+		elv_mark_ioq_sync(cfqq->ioq);
 	}
 	cfqq->pid = pid;
 }
@@ -1672,8 +1291,13 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 
 retry:
+	iog = elv_io_get_io_group(q, 0);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -1683,8 +1307,29 @@ retry:
 	 * originally, since it should just be a temporary situation.
 	 */
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
+		/* Allocate ioq object first and then cfqq */
+		if (new_ioq) {
+			goto alloc_cfqq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(cfqd->queue->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			spin_lock_irq(cfqd->queue->queue_lock);
+			if (new_ioq)
+				goto retry;
+		} else
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+
+alloc_cfqq:
+		if (!ioq && !new_ioq) {
+			/* ioq allocation failed. Deafult to oom_cfqq */
+			cfqq = &cfqd->oom_cfqq;
+			goto out;
+		}
+
 		cfqq = NULL;
 		if (new_cfqq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
 		} else if (gfp_mask & __GFP_WAIT) {
@@ -1702,60 +1347,59 @@ retry:
 		}
 
 		if (cfqq) {
+			elv_init_ioq(q->elevator, ioq, current->pid, is_sync);
+			elv_init_ioq_sched_queue(q->elevator, ioq, cfqq);
+
+			cfqq->ioq = ioq;
 			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
 			cfq_init_prio_data(cfqq, ioc);
+
+			/* call it after cfq has initialized queue prio */
+			elv_init_ioq_io_group(ioq, iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
-		} else
+		} else {
 			cfqq = &cfqd->oom_cfqq;
+			/* If ioq allocation was successful, free it up */
+			if (ioq)
+				elv_free_ioq(ioq);
+		}
 	}
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+out:
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	      gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq)
 		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
+							cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1960,7 +1604,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
+	enable_idle = old_idle = elv_ioq_idle_window(cfqq->ioq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
 	    (cfqd->hw_tag && CIC_SEEKY(cic)))
@@ -1975,9 +1619,9 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (old_idle != enable_idle) {
 		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
 		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
+			elv_mark_ioq_idle_window(cfqq->ioq);
 		else
-			cfq_clear_cfqq_idle_window(cfqq);
+			elv_clear_ioq_idle_window(cfqq->ioq);
 	}
 }
 
@@ -1986,16 +1630,15 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  * no or if we aren't sure, a 1 will cause a preempt.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2018,13 +1661,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2038,27 +1675,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
  */
@@ -2077,36 +1693,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2130,11 +1716,13 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_update_hw_tag(struct cfq_data *cfqd)
 {
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+	struct elevator_queue *eq = cfqd->queue->elevator;
+
+	if (elv_rq_in_driver(eq) > cfqd->rq_in_driver_peak)
+		cfqd->rq_in_driver_peak = elv_rq_in_driver(eq);
 
 	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+	    elv_rq_in_driver(eq) <= CFQ_HW_QUEUE_MIN)
 		return;
 
 	if (cfqd->hw_tag_samples++ < 50)
@@ -2161,44 +1749,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2207,29 +1761,32 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
+	if ((elv_ioq_wait_request(cfqq->ioq) || cfq_cfqq_must_alloc(cfqq)) &&
 	    !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
@@ -2282,7 +1839,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2318,119 +1875,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2439,12 +1908,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2457,8 +1921,6 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,25 +1935,20 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	 * will not attempt to free it.
 	 */
 	cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
-	atomic_inc(&cfqd->oom_cfqq.ref);
+
+	/* Link up oom_ioq and oom_cfqq */
+	cfqd->oom_cfqq.ioq = elv_get_oom_ioq(eq);
+	elv_init_ioq_sched_queue(eq, elv_get_oom_ioq(eq), &cfqd->oom_cfqq);
 
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
@@ -2560,8 +2017,6 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
 SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2590,8 +2045,6 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2605,10 +2058,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2621,8 +2074,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2632,7 +2083,14 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2640,14 +2098,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
                     ` (25 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This patch introduces core changes in fair queuing scheduler to support
  hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  158 ++++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |   19 ++++++
 init/Kconfig        |    8 +++
 3 files changed, 177 insertions(+), 8 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1ca7b4a..6546df0 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -137,6 +137,88 @@ static inline struct io_group *iog_of(struct io_entity *entity)
 	return NULL;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity)	\
+	for (; entity && entity->parent; entity = entity->parent)
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (parent_entity(entity) == parent_entity(new_entity))
+		return 1;
+
+	return 0;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return iog_of(parent_entity(&ioq->entity));
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	return &iog_of(parent_entity(entity))->sched_data;
+}
+
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity) { }
+
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	return 1;
+}
+
 static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 {
 	return ioq_of(entity)->efqd;
@@ -155,6 +237,7 @@ io_entity_sched_data(struct io_entity *entity)
 
 	return &efqd->root_group->sched_data;
 }
+#endif /* GROUP_IOSCHED */
 
 static inline void
 init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
@@ -171,8 +254,10 @@ static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
-	entity->vdisktime += elv_delta_fair(served, entity);
-	update_min_vdisktime(entity->st);
+	for_each_entity(entity) {
+		entity->vdisktime += elv_delta_fair(served, entity);
+		update_min_vdisktime(entity->st);
+	}
 }
 
 static void place_entity(struct io_service_tree *st, struct io_entity *entity,
@@ -388,14 +473,23 @@ static void put_prev_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	put_prev_io_entity(entity);
+	for_each_entity(entity) {
+		put_prev_io_entity(entity);
+	}
 }
 
 static void dequeue_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	dequeue_io_entity(entity);
+	for_each_entity(entity) {
+		struct io_sched_data *sd = io_entity_sched_data(entity);
+
+		dequeue_io_entity(entity);
+		/* Don't dequeue parent if it has other entities besides us */
+		if (sd->nr_active)
+			break;
+	}
 	elv_put_ioq(ioq);
 	return;
 }
@@ -406,7 +500,12 @@ static void enqueue_ioq(struct io_queue *ioq)
 	struct io_entity *entity = &ioq->entity;
 
 	elv_get_ioq(ioq);
-	enqueue_io_entity(entity);
+
+	for_each_entity(entity) {
+		if (entity->on_st)
+			break;
+		enqueue_io_entity(entity);
+	}
 }
 
 static inline void
@@ -638,6 +737,38 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -666,6 +797,8 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -691,9 +824,11 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = lookup_next_io_entity(sd);
-	if (!entity)
-		return NULL;
+	for (; sd != NULL; sd = entity->my_sd) {
+		entity = lookup_next_io_entity(sd);
+		if (!entity)
+			return NULL;
+	}
 
 	ioq = ioq_of(entity);
 	return ioq;
@@ -894,6 +1029,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	find_matching_io_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6d3809f..776f429 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -86,6 +86,23 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+struct io_group {
+	struct io_entity entity;
+	atomic_t ref;
+	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
+};
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
@@ -99,6 +116,8 @@ struct io_group {
 	void *key;
 };
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..29f701d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o This patch introduces core changes in fair queuing scheduler to support
  hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  158 ++++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |   19 ++++++
 init/Kconfig        |    8 +++
 3 files changed, 177 insertions(+), 8 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1ca7b4a..6546df0 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -137,6 +137,88 @@ static inline struct io_group *iog_of(struct io_entity *entity)
 	return NULL;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity)	\
+	for (; entity && entity->parent; entity = entity->parent)
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (parent_entity(entity) == parent_entity(new_entity))
+		return 1;
+
+	return 0;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return iog_of(parent_entity(&ioq->entity));
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	return &iog_of(parent_entity(entity))->sched_data;
+}
+
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity) { }
+
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	return 1;
+}
+
 static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 {
 	return ioq_of(entity)->efqd;
@@ -155,6 +237,7 @@ io_entity_sched_data(struct io_entity *entity)
 
 	return &efqd->root_group->sched_data;
 }
+#endif /* GROUP_IOSCHED */
 
 static inline void
 init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
@@ -171,8 +254,10 @@ static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
-	entity->vdisktime += elv_delta_fair(served, entity);
-	update_min_vdisktime(entity->st);
+	for_each_entity(entity) {
+		entity->vdisktime += elv_delta_fair(served, entity);
+		update_min_vdisktime(entity->st);
+	}
 }
 
 static void place_entity(struct io_service_tree *st, struct io_entity *entity,
@@ -388,14 +473,23 @@ static void put_prev_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	put_prev_io_entity(entity);
+	for_each_entity(entity) {
+		put_prev_io_entity(entity);
+	}
 }
 
 static void dequeue_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	dequeue_io_entity(entity);
+	for_each_entity(entity) {
+		struct io_sched_data *sd = io_entity_sched_data(entity);
+
+		dequeue_io_entity(entity);
+		/* Don't dequeue parent if it has other entities besides us */
+		if (sd->nr_active)
+			break;
+	}
 	elv_put_ioq(ioq);
 	return;
 }
@@ -406,7 +500,12 @@ static void enqueue_ioq(struct io_queue *ioq)
 	struct io_entity *entity = &ioq->entity;
 
 	elv_get_ioq(ioq);
-	enqueue_io_entity(entity);
+
+	for_each_entity(entity) {
+		if (entity->on_st)
+			break;
+		enqueue_io_entity(entity);
+	}
 }
 
 static inline void
@@ -638,6 +737,38 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -666,6 +797,8 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -691,9 +824,11 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = lookup_next_io_entity(sd);
-	if (!entity)
-		return NULL;
+	for (; sd != NULL; sd = entity->my_sd) {
+		entity = lookup_next_io_entity(sd);
+		if (!entity)
+			return NULL;
+	}
 
 	ioq = ioq_of(entity);
 	return ioq;
@@ -894,6 +1029,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	find_matching_io_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6d3809f..776f429 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -86,6 +86,23 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+struct io_group {
+	struct io_entity entity;
+	atomic_t ref;
+	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
+};
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
@@ -99,6 +116,8 @@ struct io_group {
 	void *key;
 };
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..29f701d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o This patch introduces core changes in fair queuing scheduler to support
  hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  158 ++++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |   19 ++++++
 init/Kconfig        |    8 +++
 3 files changed, 177 insertions(+), 8 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1ca7b4a..6546df0 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -137,6 +137,88 @@ static inline struct io_group *iog_of(struct io_entity *entity)
 	return NULL;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity)	\
+	for (; entity && entity->parent; entity = entity->parent)
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (parent_entity(entity) == parent_entity(new_entity))
+		return 1;
+
+	return 0;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return iog_of(parent_entity(&ioq->entity));
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+	return &iog_of(parent_entity(entity))->sched_data;
+}
+
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+static void find_matching_io_entity(struct io_entity **entity,
+			struct io_entity **new_entity) { }
+
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	return 1;
+}
+
 static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
 {
 	return ioq_of(entity)->efqd;
@@ -155,6 +237,7 @@ io_entity_sched_data(struct io_entity *entity)
 
 	return &efqd->root_group->sched_data;
 }
+#endif /* GROUP_IOSCHED */
 
 static inline void
 init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
@@ -171,8 +254,10 @@ static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
-	entity->vdisktime += elv_delta_fair(served, entity);
-	update_min_vdisktime(entity->st);
+	for_each_entity(entity) {
+		entity->vdisktime += elv_delta_fair(served, entity);
+		update_min_vdisktime(entity->st);
+	}
 }
 
 static void place_entity(struct io_service_tree *st, struct io_entity *entity,
@@ -388,14 +473,23 @@ static void put_prev_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	put_prev_io_entity(entity);
+	for_each_entity(entity) {
+		put_prev_io_entity(entity);
+	}
 }
 
 static void dequeue_ioq(struct io_queue *ioq)
 {
 	struct io_entity *entity = &ioq->entity;
 
-	dequeue_io_entity(entity);
+	for_each_entity(entity) {
+		struct io_sched_data *sd = io_entity_sched_data(entity);
+
+		dequeue_io_entity(entity);
+		/* Don't dequeue parent if it has other entities besides us */
+		if (sd->nr_active)
+			break;
+	}
 	elv_put_ioq(ioq);
 	return;
 }
@@ -406,7 +500,12 @@ static void enqueue_ioq(struct io_queue *ioq)
 	struct io_entity *entity = &ioq->entity;
 
 	elv_get_ioq(ioq);
-	enqueue_io_entity(entity);
+
+	for_each_entity(entity) {
+		if (entity->on_st)
+			break;
+		enqueue_io_entity(entity);
+	}
 }
 
 static inline void
@@ -638,6 +737,38 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd->root_group;
+
+	put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	iog->entity.my_sd = &iog->sched_data;
+	iog->key = key;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -666,6 +797,8 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -691,9 +824,11 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = lookup_next_io_entity(sd);
-	if (!entity)
-		return NULL;
+	for (; sd != NULL; sd = entity->my_sd) {
+		entity = lookup_next_io_entity(sd);
+		if (!entity)
+			return NULL;
+	}
 
 	ioq = ioq_of(entity);
 	return ioq;
@@ -894,6 +1029,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	find_matching_io_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6d3809f..776f429 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -86,6 +86,23 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+struct io_group {
+	struct io_entity entity;
+	atomic_t ref;
+	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+	void *key;
+};
+
+#else /* CONFIG_GROUP_IOSCHED */
+
 struct io_group {
 	struct io_entity entity;
 	struct io_sched_data sched_data;
@@ -99,6 +116,8 @@ struct io_group {
 	void *key;
 };
 
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..29f701d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
                     ` (24 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  167 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   14 ++++
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 195 insertions(+), 0 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6546df0..d0f341e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -739,6 +739,173 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_WEIGHT_DEFAULT,
+	.ioprio_class = IOPRIO_CLASS_BE,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype io_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, io_files, ARRAY_SIZE(io_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_WEIGHT_DEFAULT;
+	iocg->ioprio_class = IOPRIO_CLASS_BE;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 776f429..f92afac 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,6 +13,7 @@
 
 #ifdef CONFIG_BLOCK
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _ELV_SCHED_H
 #define _ELV_SCHED_H
@@ -91,6 +92,8 @@ struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
 	struct io_sched_data sched_data;
+	struct hlist_node group_node;
+	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
@@ -101,6 +104,17 @@ struct io_group {
 	void *key;
 };
 
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 struct io_group {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..b343594 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  167 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   14 ++++
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 195 insertions(+), 0 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6546df0..d0f341e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -739,6 +739,173 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_WEIGHT_DEFAULT,
+	.ioprio_class = IOPRIO_CLASS_BE,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype io_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, io_files, ARRAY_SIZE(io_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_WEIGHT_DEFAULT;
+	iocg->ioprio_class = IOPRIO_CLASS_BE;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 776f429..f92afac 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,6 +13,7 @@
 
 #ifdef CONFIG_BLOCK
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _ELV_SCHED_H
 #define _ELV_SCHED_H
@@ -91,6 +92,8 @@ struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
 	struct io_sched_data sched_data;
+	struct hlist_node group_node;
+	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
@@ -101,6 +104,17 @@ struct io_group {
 	void *key;
 };
 
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 struct io_group {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..b343594 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  167 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   14 ++++
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 195 insertions(+), 0 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6546df0..d0f341e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -739,6 +739,173 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_WEIGHT_DEFAULT,
+	.ioprio_class = IOPRIO_CLASS_BE,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype io_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, io_files, ARRAY_SIZE(io_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_WEIGHT_DEFAULT;
+	iocg->ioprio_class = IOPRIO_CLASS_BE;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 776f429..f92afac 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,6 +13,7 @@
 
 #ifdef CONFIG_BLOCK
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _ELV_SCHED_H
 #define _ELV_SCHED_H
@@ -91,6 +92,8 @@ struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
 	struct io_sched_data sched_data;
+	struct hlist_node group_node;
+	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
@@ -101,6 +104,17 @@ struct io_group {
 	void *key;
 };
 
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 struct io_group {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..b343594 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 08/23] io-controller: cfq changes to use " Vivek Goyal
                     ` (23 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  479 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   35 ++++
 block/elevator.c    |    4 +
 4 files changed, 509 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4bde1c8..decb654 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1356,6 +1356,8 @@ alloc_cfqq:
 
 			/* call it after cfq has initialized queue prio */
 			elv_init_ioq_io_group(ioq, iog);
+			/* ioq reference on iog */
+			elv_get_iog(iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
 		} else {
 			cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index d0f341e..8e40b64 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -519,6 +519,7 @@ void elv_put_ioq(struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = efqd->eq;
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
@@ -526,12 +527,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(efqd->active_queue == ioq);
+	iog = ioq_to_io_group(ioq);
 
 	/* Can be called by outgoing elevator. Don't use q */
 	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
@@ -738,6 +741,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = iocg->weight;
+	entity->ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity = &iog->entity;
+
+	init_io_entity_parent(entity, &parent->entity);
+
+	/* Child group reference on parent group. */
+	elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_WEIGHT_DEFAULT,
@@ -750,6 +774,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -889,12 +934,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -906,24 +945,210 @@ struct cgroup_subsys io_subsys = {
 	.use_id = 1,
 };
 
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+
+		atomic_set(&iog->ref, 0);
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup, struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	/*
+	 * This connects the topmost element of the allocated chain to the
+	 * parent group.
+	 */
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_cgroup *iocg = &io_root_cgroup;
+
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
 
 	put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
 	struct io_group *iog;
+	struct io_cgroup *iocg = &io_root_cgroup;
 	int i;
 
 	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	iog->entity.my_sd = &iog->sched_data;
 	iog->key = key;
@@ -931,11 +1156,215 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
 	return iog;
 }
 
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->sched_data.active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(st->active_entity != NULL);
+	}
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when elv_io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent_iog = NULL;
+	struct io_entity *parent;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	parent = parent_entity(&iog->entity);
+	if (parent)
+		parent_iog = iog_of(parent);
+
+	io_group_cleanup(iog);
+
+	if (parent_iog)
+		elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	hlist_del(&iog->elv_data_node);
+	put_io_group_queues(efqd->eq, iog);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -1012,8 +1441,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 	struct elevator_queue *eq = q->elevator;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-						efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+				" weight=%u group_weight=%u qued=%d",
+				efqd->busy_queues, ioq->entity.ioprio_class,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog), ioq->nr_queued);
+
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
 
@@ -1186,6 +1620,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1218,9 +1653,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -1365,6 +1807,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 	if (new_ioq)
 		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
 
+	/* Only select co-operating queue if it belongs to same group as ioq */
+	if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+		return NULL;
+
 	return new_ioq;
 }
 
@@ -1607,6 +2053,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1624,12 +2071,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	release_elv_io_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f92afac..56d0bfc 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -93,6 +93,7 @@ struct io_group {
 	atomic_t ref;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
+	struct hlist_node elv_data_node;
 	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
@@ -102,6 +103,7 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 	void *key;
+	struct rcu_head rcu_head;
 };
 
 struct io_cgroup {
@@ -135,6 +137,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	struct elevator_queue *eq;
 	unsigned int busy_queues;
@@ -315,6 +320,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 	return &eq->efqd->oom_ioq;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+						int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
 static inline struct io_group *
 elv_io_get_io_group(struct request_queue *q, int create)
 {
@@ -322,6 +349,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -405,6 +434,12 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !bio_failfast_driver(bio)	 != !blk_failfast_driver(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!elv_io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  479 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   35 ++++
 block/elevator.c    |    4 +
 4 files changed, 509 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4bde1c8..decb654 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1356,6 +1356,8 @@ alloc_cfqq:
 
 			/* call it after cfq has initialized queue prio */
 			elv_init_ioq_io_group(ioq, iog);
+			/* ioq reference on iog */
+			elv_get_iog(iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
 		} else {
 			cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index d0f341e..8e40b64 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -519,6 +519,7 @@ void elv_put_ioq(struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = efqd->eq;
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
@@ -526,12 +527,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(efqd->active_queue == ioq);
+	iog = ioq_to_io_group(ioq);
 
 	/* Can be called by outgoing elevator. Don't use q */
 	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
@@ -738,6 +741,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = iocg->weight;
+	entity->ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity = &iog->entity;
+
+	init_io_entity_parent(entity, &parent->entity);
+
+	/* Child group reference on parent group. */
+	elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_WEIGHT_DEFAULT,
@@ -750,6 +774,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -889,12 +934,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -906,24 +945,210 @@ struct cgroup_subsys io_subsys = {
 	.use_id = 1,
 };
 
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+
+		atomic_set(&iog->ref, 0);
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup, struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	/*
+	 * This connects the topmost element of the allocated chain to the
+	 * parent group.
+	 */
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_cgroup *iocg = &io_root_cgroup;
+
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
 
 	put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
 	struct io_group *iog;
+	struct io_cgroup *iocg = &io_root_cgroup;
 	int i;
 
 	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	iog->entity.my_sd = &iog->sched_data;
 	iog->key = key;
@@ -931,11 +1156,215 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
 	return iog;
 }
 
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->sched_data.active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(st->active_entity != NULL);
+	}
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when elv_io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent_iog = NULL;
+	struct io_entity *parent;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	parent = parent_entity(&iog->entity);
+	if (parent)
+		parent_iog = iog_of(parent);
+
+	io_group_cleanup(iog);
+
+	if (parent_iog)
+		elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	hlist_del(&iog->elv_data_node);
+	put_io_group_queues(efqd->eq, iog);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -1012,8 +1441,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 	struct elevator_queue *eq = q->elevator;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-						efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+				" weight=%u group_weight=%u qued=%d",
+				efqd->busy_queues, ioq->entity.ioprio_class,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog), ioq->nr_queued);
+
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
 
@@ -1186,6 +1620,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1218,9 +1653,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -1365,6 +1807,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 	if (new_ioq)
 		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
 
+	/* Only select co-operating queue if it belongs to same group as ioq */
+	if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+		return NULL;
+
 	return new_ioq;
 }
 
@@ -1607,6 +2053,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1624,12 +2071,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	release_elv_io_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f92afac..56d0bfc 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -93,6 +93,7 @@ struct io_group {
 	atomic_t ref;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
+	struct hlist_node elv_data_node;
 	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
@@ -102,6 +103,7 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 	void *key;
+	struct rcu_head rcu_head;
 };
 
 struct io_cgroup {
@@ -135,6 +137,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	struct elevator_queue *eq;
 	unsigned int busy_queues;
@@ -315,6 +320,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 	return &eq->efqd->oom_ioq;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+						int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
 static inline struct io_group *
 elv_io_get_io_group(struct request_queue *q, int create)
 {
@@ -322,6 +349,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -405,6 +434,12 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !bio_failfast_driver(bio)	 != !blk_failfast_driver(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!elv_io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  479 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   35 ++++
 block/elevator.c    |    4 +
 4 files changed, 509 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4bde1c8..decb654 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1356,6 +1356,8 @@ alloc_cfqq:
 
 			/* call it after cfq has initialized queue prio */
 			elv_init_ioq_io_group(ioq, iog);
+			/* ioq reference on iog */
+			elv_get_iog(iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
 		} else {
 			cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index d0f341e..8e40b64 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -519,6 +519,7 @@ void elv_put_ioq(struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = efqd->eq;
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
@@ -526,12 +527,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(efqd->active_queue == ioq);
+	iog = ioq_to_io_group(ioq);
 
 	/* Can be called by outgoing elevator. Don't use q */
 	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
@@ -738,6 +741,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = iocg->weight;
+	entity->ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity = &iog->entity;
+
+	init_io_entity_parent(entity, &parent->entity);
+
+	/* Child group reference on parent group. */
+	elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_WEIGHT_DEFAULT,
@@ -750,6 +774,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -889,12 +934,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -906,24 +945,210 @@ struct cgroup_subsys io_subsys = {
 	.use_id = 1,
 };
 
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+
+		atomic_set(&iog->ref, 0);
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup, struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	/*
+	 * This connects the topmost element of the allocated chain to the
+	 * parent group.
+	 */
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_cgroup *iocg = &io_root_cgroup;
+
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
 
 	put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
 	struct io_group *iog;
+	struct io_cgroup *iocg = &io_root_cgroup;
 	int i;
 
 	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	iog->entity.my_sd = &iog->sched_data;
 	iog->key = key;
@@ -931,11 +1156,215 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
 	return iog;
 }
 
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->sched_data.active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(st->active_entity != NULL);
+	}
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when elv_io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent_iog = NULL;
+	struct io_entity *parent;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	parent = parent_entity(&iog->entity);
+	if (parent)
+		parent_iog = iog_of(parent);
+
+	io_group_cleanup(iog);
+
+	if (parent_iog)
+		elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	hlist_del(&iog->elv_data_node);
+	put_io_group_queues(efqd->eq, iog);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -1012,8 +1441,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 	struct elevator_queue *eq = q->elevator;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-						efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+				" weight=%u group_weight=%u qued=%d",
+				efqd->busy_queues, ioq->entity.ioprio_class,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog), ioq->nr_queued);
+
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
 
@@ -1186,6 +1620,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1218,9 +1653,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -1365,6 +1807,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 	if (new_ioq)
 		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
 
+	/* Only select co-operating queue if it belongs to same group as ioq */
+	if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+		return NULL;
+
 	return new_ioq;
 }
 
@@ -1607,6 +2053,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1624,12 +2071,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	release_elv_io_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f92afac..56d0bfc 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -93,6 +93,7 @@ struct io_group {
 	atomic_t ref;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
+	struct hlist_node elv_data_node;
 	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
@@ -102,6 +103,7 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 	void *key;
+	struct rcu_head rcu_head;
 };
 
 struct io_cgroup {
@@ -135,6 +137,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	struct elevator_queue *eq;
 	unsigned int busy_queues;
@@ -315,6 +320,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 	return &eq->efqd->oom_ioq;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+						int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
 static inline struct io_group *
 elv_io_get_io_group(struct request_queue *q, int create)
 {
@@ -322,6 +349,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -405,6 +434,12 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !bio_failfast_driver(bio)	 != !blk_failfast_driver(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!elv_io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
                     ` (22 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   63 ++++++++++++++++++++++++++++++++++++++++++++++--
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index decb654..6c1f87a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1285,6 +1285,60 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfqq->pid = pid;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = elv_io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
@@ -1296,7 +1350,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1385,7 +1439,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1540,7 +1594,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 29f701d..afcaa86 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   63 ++++++++++++++++++++++++++++++++++++++++++++++--
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index decb654..6c1f87a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1285,6 +1285,60 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfqq->pid = pid;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = elv_io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
@@ -1296,7 +1350,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1385,7 +1439,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1540,7 +1594,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 29f701d..afcaa86 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   63 ++++++++++++++++++++++++++++++++++++++++++++++--
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index decb654..6c1f87a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1285,6 +1285,60 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfqq->pid = pid;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = elv_io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
@@ -1296,7 +1350,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1385,7 +1439,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
+	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1540,7 +1594,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 29f701d..afcaa86 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 08/23] io-controller: cfq changes to use " Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:30   ` [PATCH 10/23] io-controller: Debug hierarchical IO scheduling Vivek Goyal
                     ` (21 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |   10 +++++++
 2 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 8e40b64..563db2d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -13,6 +13,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -257,6 +258,8 @@ entity_served(struct io_entity *entity, unsigned long served,
 	for_each_entity(entity) {
 		entity->vdisktime += elv_delta_fair(served, entity);
 		update_min_vdisktime(entity->st);
+		entity->total_time += served;
+		entity->total_sectors += nr_sectors;
 	}
 }
 
@@ -854,6 +857,66 @@ STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_time);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sectors);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -865,6 +928,14 @@ struct cftype io_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -956,6 +1027,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -976,6 +1049,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 
 		atomic_set(&iog->ref, 0);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 56d0bfc..9757e39 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -53,6 +53,13 @@ struct io_entity {
 
 	unsigned short ioprio, ioprio_class;
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_time;
+	unsigned long total_sectors;
 };
 
 /*
@@ -104,6 +111,9 @@ struct io_group {
 	struct io_queue *async_idle_queue;
 	void *key;
 	struct rcu_head rcu_head;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 struct io_cgroup {
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |   10 +++++++
 2 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 8e40b64..563db2d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -13,6 +13,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -257,6 +258,8 @@ entity_served(struct io_entity *entity, unsigned long served,
 	for_each_entity(entity) {
 		entity->vdisktime += elv_delta_fair(served, entity);
 		update_min_vdisktime(entity->st);
+		entity->total_time += served;
+		entity->total_sectors += nr_sectors;
 	}
 }
 
@@ -854,6 +857,66 @@ STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_time);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sectors);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -865,6 +928,14 @@ struct cftype io_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -956,6 +1027,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -976,6 +1049,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 
 		atomic_set(&iog->ref, 0);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 56d0bfc..9757e39 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -53,6 +53,13 @@ struct io_entity {
 
 	unsigned short ioprio, ioprio_class;
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_time;
+	unsigned long total_sectors;
 };
 
 /*
@@ -104,6 +111,9 @@ struct io_group {
 	struct io_queue *async_idle_queue;
 	void *key;
 	struct rcu_head rcu_head;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 struct io_cgroup {
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |   10 +++++++
 2 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 8e40b64..563db2d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -13,6 +13,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -257,6 +258,8 @@ entity_served(struct io_entity *entity, unsigned long served,
 	for_each_entity(entity) {
 		entity->vdisktime += elv_delta_fair(served, entity);
 		update_min_vdisktime(entity->st);
+		entity->total_time += served;
+		entity->total_sectors += nr_sectors;
 	}
 }
 
@@ -854,6 +857,66 @@ STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_time);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sectors);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -865,6 +928,14 @@ struct cftype io_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -956,6 +1027,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -976,6 +1049,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 
 		atomic_set(&iog->ref, 0);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 56d0bfc..9757e39 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -53,6 +53,13 @@ struct io_entity {
 
 	unsigned short ioprio, ioprio_class;
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_time;
+	unsigned long total_sectors;
 };
 
 /*
@@ -104,6 +111,9 @@ struct io_group {
 	struct io_queue *async_idle_queue;
 	void *key;
 	struct rcu_head rcu_head;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 struct io_cgroup {
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 11/23] io-controller: Introduce group idling Vivek Goyal
                     ` (20 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    8 +++
 block/elevator-fq.c   |  168 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   29 +++++++++
 3 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..a7d0bf8 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -90,6 +90,14 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
 endmenu
 
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 563db2d..cb348fa 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -251,6 +251,91 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, iog->path, sizeof(iog->path));
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	iog->path[0] = '\0';
+	return;
+}
+
+static inline void debug_update_stats_enqueue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void debug_update_stats_dequeue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void print_ioq_service_stats(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	elv_log_ioq(ioq->efqd, ioq, "service: QTt=%lu QTs=%lu GTt=%lu GTs=%lu",
+			ioq->entity.total_time, ioq->entity.total_sectors,
+			iog->entity.total_time, iog->entity.total_sectors);
+}
+
+#else /* DEBUG_GROUP_IOSCHED */
+static inline void io_group_path(struct io_group *iog) {}
+static inline void print_ioq_service_stats(struct io_queue *ioq) {}
+static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
+static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
+#endif /* DEBUG_GROUP_IOSCHED */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
@@ -338,6 +423,7 @@ static void dequeue_io_entity(struct io_entity *entity)
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	debug_update_stats_dequeue(entity);
 }
 
 static void
@@ -386,6 +472,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
+	debug_update_stats_enqueue(entity);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -544,6 +631,9 @@ EXPORT_SYMBOL(elv_put_ioq);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
+			served, ioq->nr_sectors, ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 /*
@@ -797,7 +887,6 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
-
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -917,6 +1006,64 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue);
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -936,6 +1083,16 @@ struct cftype io_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1063,6 +1220,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 * or cgroup deletion path depending on who is exiting first.
 		 */
 		elv_get_iog(iog);
+		io_group_path(iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1237,6 +1395,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
+	io_group_path(iog);
 
 	return iog;
 }
@@ -1523,6 +1682,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 				efqd->busy_queues, ioq->entity.ioprio_class,
 				ioq->entity.ioprio, ioq->entity.weight,
 				iog_weight(iog), ioq->nr_queued);
+		print_ioq_service_stats(ioq);
 
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
@@ -1581,10 +1741,11 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 {
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(ioq == efqd->active_queue);
-	elv_log_ioq(efqd, ioq, "add to busy");
 	enqueue_ioq(ioq);
 	elv_mark_ioq_busy(ioq);
 	efqd->busy_queues++;
+	elv_log_ioq(efqd, ioq, "add to busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
@@ -1593,7 +1754,8 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
-	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_log_ioq(efqd, ioq, "del from busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9757e39..154014c 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -114,6 +114,16 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_cgroup {
@@ -170,10 +180,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:30   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 +++
 block/elevator-fq.c   |  168 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   29 +++++++++
 3 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..a7d0bf8 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -90,6 +90,14 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
 endmenu
 
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 563db2d..cb348fa 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -251,6 +251,91 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, iog->path, sizeof(iog->path));
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	iog->path[0] = '\0';
+	return;
+}
+
+static inline void debug_update_stats_enqueue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void debug_update_stats_dequeue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void print_ioq_service_stats(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	elv_log_ioq(ioq->efqd, ioq, "service: QTt=%lu QTs=%lu GTt=%lu GTs=%lu",
+			ioq->entity.total_time, ioq->entity.total_sectors,
+			iog->entity.total_time, iog->entity.total_sectors);
+}
+
+#else /* DEBUG_GROUP_IOSCHED */
+static inline void io_group_path(struct io_group *iog) {}
+static inline void print_ioq_service_stats(struct io_queue *ioq) {}
+static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
+static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
+#endif /* DEBUG_GROUP_IOSCHED */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
@@ -338,6 +423,7 @@ static void dequeue_io_entity(struct io_entity *entity)
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	debug_update_stats_dequeue(entity);
 }
 
 static void
@@ -386,6 +472,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
+	debug_update_stats_enqueue(entity);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -544,6 +631,9 @@ EXPORT_SYMBOL(elv_put_ioq);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
+			served, ioq->nr_sectors, ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 /*
@@ -797,7 +887,6 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
-
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -917,6 +1006,64 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue);
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -936,6 +1083,16 @@ struct cftype io_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1063,6 +1220,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 * or cgroup deletion path depending on who is exiting first.
 		 */
 		elv_get_iog(iog);
+		io_group_path(iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1237,6 +1395,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
+	io_group_path(iog);
 
 	return iog;
 }
@@ -1523,6 +1682,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 				efqd->busy_queues, ioq->entity.ioprio_class,
 				ioq->entity.ioprio, ioq->entity.weight,
 				iog_weight(iog), ioq->nr_queued);
+		print_ioq_service_stats(ioq);
 
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
@@ -1581,10 +1741,11 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 {
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(ioq == efqd->active_queue);
-	elv_log_ioq(efqd, ioq, "add to busy");
 	enqueue_ioq(ioq);
 	elv_mark_ioq_busy(ioq);
 	efqd->busy_queues++;
+	elv_log_ioq(efqd, ioq, "add to busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
@@ -1593,7 +1754,8 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
-	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_log_ioq(efqd, ioq, "del from busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9757e39..154014c 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -114,6 +114,16 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_cgroup {
@@ -170,10 +180,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
@ 2009-08-28 21:30   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 +++
 block/elevator-fq.c   |  168 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   29 +++++++++
 3 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..a7d0bf8 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -90,6 +90,14 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
 endmenu
 
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 563db2d..cb348fa 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -251,6 +251,91 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, iog->path, sizeof(iog->path));
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	iog->path[0] = '\0';
+	return;
+}
+
+static inline void debug_update_stats_enqueue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void debug_update_stats_dequeue(struct io_entity *entity)
+{
+	struct io_group *iog = iog_of(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+static inline void print_ioq_service_stats(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	elv_log_ioq(ioq->efqd, ioq, "service: QTt=%lu QTs=%lu GTt=%lu GTs=%lu",
+			ioq->entity.total_time, ioq->entity.total_sectors,
+			iog->entity.total_time, iog->entity.total_sectors);
+}
+
+#else /* DEBUG_GROUP_IOSCHED */
+static inline void io_group_path(struct io_group *iog) {}
+static inline void print_ioq_service_stats(struct io_queue *ioq) {}
+static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
+static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
+#endif /* DEBUG_GROUP_IOSCHED */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
@@ -338,6 +423,7 @@ static void dequeue_io_entity(struct io_entity *entity)
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	debug_update_stats_dequeue(entity);
 }
 
 static void
@@ -386,6 +472,7 @@ static void enqueue_io_entity(struct io_entity *entity)
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
+	debug_update_stats_enqueue(entity);
 }
 
 static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -544,6 +631,9 @@ EXPORT_SYMBOL(elv_put_ioq);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
+			served, ioq->nr_sectors, ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 /*
@@ -797,7 +887,6 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
-
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -917,6 +1006,64 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue);
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key)
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype io_files[] = {
 	{
 		.name = "weight",
@@ -936,6 +1083,16 @@ struct cftype io_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1063,6 +1220,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 * or cgroup deletion path depending on who is exiting first.
 		 */
 		elv_get_iog(iog);
+		io_group_path(iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1237,6 +1395,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
+	io_group_path(iog);
 
 	return iog;
 }
@@ -1523,6 +1682,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 				efqd->busy_queues, ioq->entity.ioprio_class,
 				ioq->entity.ioprio, ioq->entity.weight,
 				iog_weight(iog), ioq->nr_queued);
+		print_ioq_service_stats(ioq);
 
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
@@ -1581,10 +1741,11 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 {
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(ioq == efqd->active_queue);
-	elv_log_ioq(efqd, ioq, "add to busy");
 	enqueue_ioq(ioq);
 	elv_mark_ioq_busy(ioq);
 	efqd->busy_queues++;
+	elv_log_ioq(efqd, ioq, "add to busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
@@ -1593,7 +1754,8 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
-	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_log_ioq(efqd, ioq, "del from busy: qued=%d", ioq->nr_queued);
+	print_ioq_service_stats(ioq);
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9757e39..154014c 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -114,6 +114,16 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_cgroup {
@@ -170,10 +180,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 11/23] io-controller: Introduce group idling
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-08-28 21:30   ` [PATCH 10/23] io-controller: Debug hierarchical IO scheduling Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
                     ` (19 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o It is not always that IO from a process or group is continuous. There are
  cases of dependent reads where next read is not issued till previous read
  has finished. For such cases, CFQ introduced the notion of slice_idle,
  where we idle on the queue for sometime hoping next request will come
  and that's how fairness is provided otherwise queue will be deleted
  immediately from the service tree and this process will not get the
  fair share.

o This patch introduces the similar concept at group level. Idle on the group
  for a period of "group_idle" which is tunable through sysfs interface. So
  if a group is empty and about to be deleted, we idle for the next request.

o This patch also introduces the notion of wait busy where we wait for one
  extra group_idle period even if queue has consumed its time slice. The
  reason being that group will loose its share upon removal from service
  tree as some other entity will be picked for dispatch and vtime jump will
  take place.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    5 +-
 block/elevator-fq.c |  207 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   44 +++++++++++-
 3 files changed, 247 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6c1f87a..11ae473 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -980,7 +980,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
-	    cfq_class_idle(cfqq))) {
+	    (cfq_class_idle(cfqq) && !elv_iog_should_idle(cfqq->ioq)))) {
 		cfq_slice_expired(cfqd);
 	}
 
@@ -2121,6 +2121,9 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+#ifdef CONFIG_GROUP_IOSCHED
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cb348fa..6ea5be4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,7 @@
 const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
+int elv_group_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
 /*
@@ -251,6 +252,17 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service trees as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	return iog->sched_data.nr_active;
+}
+
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 static void io_group_path(struct io_group *iog)
 {
@@ -663,6 +675,8 @@ ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return elv_var_show(__data, (page));				\
 }
+SHOW_FUNCTION(elv_group_idle_show, efqd->elv_group_idle, 1);
+EXPORT_SYMBOL(elv_group_idle_show);
 SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
@@ -685,6 +699,8 @@ ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
 		*(__PTR) = __data;					\
 	return ret;							\
 }
+STORE_FUNCTION(elv_group_idle_store, &efqd->elv_group_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_group_idle_store);
 STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
@@ -846,6 +862,31 @@ static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 	entity->my_sd = &iog->sched_data;
 }
 
+/* Check if we plan to idle on the group associated with this queue or not */
+int elv_iog_should_idle(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * No idling on group if group idle is disabled or idling is disabled
+	 * for this group. Currently for root group idling is disabled.
+	 */
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return 0;
+
+	/*
+	 * If this is last active queue in group with no request queued, we
+	 * need to idle on group before expiring the queue to make sure group
+	 * does not loose its share.
+	 */
+	if ((elv_iog_nr_active(iog) <= 1) && !ioq->nr_queued)
+		return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 {
 	struct io_entity *entity = &iog->entity;
@@ -1213,6 +1254,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		atomic_set(&iog->ref, 0);
 
+		elv_mark_iog_idle_window(iog);
 		/*
 		 * Take the initial reference that will be released on destroy
 		 * This can be thought of a joint reference by cgroup and
@@ -1628,6 +1670,10 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+/* No group idling in flat mode */
+int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1688,7 +1734,9 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		ioq->dispatch_start = jiffies;
 
 		elv_clear_ioq_wait_request(ioq);
+		elv_clear_iog_wait_request(iog);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -1787,14 +1835,19 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	long slice_used = 0, slice_overshoot = 0;
+	struct io_group *iog = ioq_to_io_group(ioq);
 
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_iog_wait_request(iog)
+	    || elv_iog_wait_busy(iog))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_iog_wait_request(iog);
+	elv_clear_iog_wait_busy(iog);
+	elv_clear_iog_wait_busy_done(iog);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -1928,6 +1981,8 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog = ioq_to_io_group(ioq);
+	int group_wait = 0;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
@@ -1940,6 +1995,24 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 	if (!elv_ioq_busy(ioq))
 		elv_add_ioq_busy(efqd, ioq);
 
+	if (elv_iog_wait_request(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_request(iog);
+		group_wait = 1;
+	}
+
+	/*
+	 * If we were waiting for a request on this group, wait is
+	 * done. Schedule the next dispatch
+	 */
+	if (elv_iog_wait_busy(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_busy(iog);
+		elv_mark_iog_wait_busy_done(iog);
+		elv_schedule_dispatch(q);
+		return;
+	}
+
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
 		 * Remember that we saw a request from this process, but
@@ -1951,7 +2024,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (group_wait || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -1968,6 +2041,13 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
+	} else if (group_wait) {
+		/*
+		 * Got a request in the group we were waiting for. Request
+		 * does not belong to active queue and we have not decided
+		 * to preempt the current active queue. Schedule the dispatch.
+		 */
+		elv_schedule_dispatch(q);
 	}
 }
 
@@ -1985,6 +2065,14 @@ static void elv_idle_slice_timer(unsigned long data)
 	ioq = efqd->active_queue;
 
 	if (ioq) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+
+		elv_clear_iog_wait_request(iog);
+
+		if (elv_iog_wait_busy(iog)) {
+			elv_clear_iog_wait_busy(iog);
+			goto expire;
+		}
 
 		/*
 		 * We saw a request before the queue expired, let it through
@@ -2028,6 +2116,32 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
 }
 
+static void elv_iog_arm_slice_timer(struct request_queue *q,
+				struct io_group *iog, int wait_for_busy)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	unsigned long sl;
+
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return;
+	/*
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
+	 */
+	if (wait_for_busy) {
+		elv_mark_iog_wait_busy(iog);
+		sl = efqd->elv_group_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_iog(efqd, iog, "arm idle group: %lu wait busy=1", sl);
+		return;
+	}
+
+	elv_mark_iog_wait_request(iog);
+	sl = efqd->elv_group_idle;
+	mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+	elv_log_iog(efqd, iog, "arm_idle group: %lu", sl);
+}
+
 /*
  * If io scheduler has functionality of keeping track of close cooperator, check
  * with it if it has got a closely co-operating queue.
@@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	if (ioq == NULL)
 		goto new_queue;
 
+	iog = ioq_to_io_group(ioq);
+
 	/*
 	 * Force dispatch. Continue to dispatch from current queue as long
 	 * as it has requests.
@@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this group to become busy before it expires.*/
+	if (elv_iog_wait_busy(iog)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+	     && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. If this group will be deleted
+		 * after the queue expiry, then make sure we have onece
+		 * done wait busy on the group in an attempt to make it
+		 * backlogged.
+		 *
+		 * Following check helps in two conditions.
+		 * - If there are requests dispatched from the queue and
+		 *   select_ioq() comes before a request completed from the
+		 *   queue and got a chance to arm any of the idle timers.
+		 *
+		 * - If at request completion time slice had not expired and
+		 *   we armed either a ioq timer or group timer but when
+		 *   select_ioq() hits, slice has expired and it will expire
+		 *   the queue without doing busy wait on group.
+		 *
+		 * In similar situations cfq lets delte the queue even if
+		 * idle timer is armed. That does not impact fairness in non
+		 * hierarhical setup due to weighted slice lengths. But in
+		 * hierarchical setup where group slice lengths are derived
+		 * from queue and is not proportional to group's weight, it
+		 * harms the fairness of the group.
+		 */
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * The active queue has requests and isn't expired, allow it to
@@ -2111,6 +2264,12 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/* Check for group idling */
+	if (elv_iog_should_idle(ioq) && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 expire:
 	elv_slice_expired(q);
 new_queue:
@@ -2182,11 +2341,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 	WARN_ON(!efqd->rq_in_driver);
 	WARN_ON(!ioq->dispatched);
 	efqd->rq_in_driver--;
@@ -2212,13 +2373,44 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+			/*
+			 * This is the last empty queue in the group and it
+			 * has consumed its slice. If we expire it right away
+			 * group might loose its share. Wait for an extra
+			 * group_idle period for a request before queue
+			 * expires.
+			 */
+			if (elv_iog_should_idle(ioq)) {
+				elv_iog_arm_slice_timer(q, iog, 1);
+				goto done;
+			}
+
+			/* Expire the queue */
 			elv_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			goto done;
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
+		/*
+		 * If this is the last queue in the group and we did not
+		 * decide to idle on queue, idle on group.
+		 */
+		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		    && !timer_pending(&efqd->idle_slice_timer)) {
+			/*
+			 * If queue has used up its slice, wait for the
+			 * one extra group_idle period to let the group
+			 * backlogged again. This is to avoid a group loosing
+			 * its fair share.
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_iog_arm_slice_timer(q, iog, 1);
+			else
+				elv_iog_arm_slice_timer(q, iog, 0);
+		}
 	}
-
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -2295,6 +2487,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_group_idle = elv_group_idle;
 
 	return 0;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 154014c..0a34c7f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -98,6 +98,7 @@ struct io_queue {
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
 	struct hlist_node elv_data_node;
@@ -172,6 +173,8 @@ struct elv_fq_data {
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
+	unsigned int elv_group_idle;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
 
@@ -240,6 +243,42 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+enum elv_group_state_flags {
+	ELV_GROUP_FLAG_idle_window,	  /* elevator group idling enabled */
+	ELV_GROUP_FLAG_wait_request,	  /* waiting for a request */
+	ELV_GROUP_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_GROUP_FLAG_wait_busy_done,	  /* Have already waited on this group*/
+};
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags |= (1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags &= ~(1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline int elv_iog_##name(struct io_group *iog)         		\
+{                                                                       \
+	return ((iog)->flags & (1 << ELV_GROUP_FLAG_##name)) != 0;	\
+}
+
+#else /* GROUP_IOSCHED */
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog) {}		\
+static inline void elv_clear_iog_##name(struct io_group *iog) {}	\
+static inline int elv_iog_##name(struct io_group *iog) { return 0; }
+#endif /* GROUP_IOSCHED */
+
+ELV_IO_GROUP_FLAG_FNS(idle_window)
+ELV_IO_GROUP_FLAG_FNS(wait_request)
+ELV_IO_GROUP_FLAG_FNS(wait_busy)
+ELV_IO_GROUP_FLAG_FNS(wait_busy_done)
+
 static inline void elv_get_ioq(struct io_queue *ioq)
 {
 	atomic_inc(&ioq->ref);
@@ -365,7 +404,9 @@ extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
 						int create);
-
+extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
+					size_t count);
 static inline void elv_get_iog(struct io_group *iog)
 {
 	atomic_inc(&iog->ref);
@@ -433,6 +474,7 @@ extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+extern int elv_iog_should_idle(struct io_queue *ioq);
 
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 11/23] io-controller: Introduce group idling
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o It is not always that IO from a process or group is continuous. There are
  cases of dependent reads where next read is not issued till previous read
  has finished. For such cases, CFQ introduced the notion of slice_idle,
  where we idle on the queue for sometime hoping next request will come
  and that's how fairness is provided otherwise queue will be deleted
  immediately from the service tree and this process will not get the
  fair share.

o This patch introduces the similar concept at group level. Idle on the group
  for a period of "group_idle" which is tunable through sysfs interface. So
  if a group is empty and about to be deleted, we idle for the next request.

o This patch also introduces the notion of wait busy where we wait for one
  extra group_idle period even if queue has consumed its time slice. The
  reason being that group will loose its share upon removal from service
  tree as some other entity will be picked for dispatch and vtime jump will
  take place.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    5 +-
 block/elevator-fq.c |  207 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   44 +++++++++++-
 3 files changed, 247 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6c1f87a..11ae473 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -980,7 +980,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
-	    cfq_class_idle(cfqq))) {
+	    (cfq_class_idle(cfqq) && !elv_iog_should_idle(cfqq->ioq)))) {
 		cfq_slice_expired(cfqd);
 	}
 
@@ -2121,6 +2121,9 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+#ifdef CONFIG_GROUP_IOSCHED
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cb348fa..6ea5be4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,7 @@
 const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
+int elv_group_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
 /*
@@ -251,6 +252,17 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service trees as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	return iog->sched_data.nr_active;
+}
+
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 static void io_group_path(struct io_group *iog)
 {
@@ -663,6 +675,8 @@ ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return elv_var_show(__data, (page));				\
 }
+SHOW_FUNCTION(elv_group_idle_show, efqd->elv_group_idle, 1);
+EXPORT_SYMBOL(elv_group_idle_show);
 SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
@@ -685,6 +699,8 @@ ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
 		*(__PTR) = __data;					\
 	return ret;							\
 }
+STORE_FUNCTION(elv_group_idle_store, &efqd->elv_group_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_group_idle_store);
 STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
@@ -846,6 +862,31 @@ static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 	entity->my_sd = &iog->sched_data;
 }
 
+/* Check if we plan to idle on the group associated with this queue or not */
+int elv_iog_should_idle(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * No idling on group if group idle is disabled or idling is disabled
+	 * for this group. Currently for root group idling is disabled.
+	 */
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return 0;
+
+	/*
+	 * If this is last active queue in group with no request queued, we
+	 * need to idle on group before expiring the queue to make sure group
+	 * does not loose its share.
+	 */
+	if ((elv_iog_nr_active(iog) <= 1) && !ioq->nr_queued)
+		return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 {
 	struct io_entity *entity = &iog->entity;
@@ -1213,6 +1254,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		atomic_set(&iog->ref, 0);
 
+		elv_mark_iog_idle_window(iog);
 		/*
 		 * Take the initial reference that will be released on destroy
 		 * This can be thought of a joint reference by cgroup and
@@ -1628,6 +1670,10 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+/* No group idling in flat mode */
+int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1688,7 +1734,9 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		ioq->dispatch_start = jiffies;
 
 		elv_clear_ioq_wait_request(ioq);
+		elv_clear_iog_wait_request(iog);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -1787,14 +1835,19 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	long slice_used = 0, slice_overshoot = 0;
+	struct io_group *iog = ioq_to_io_group(ioq);
 
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_iog_wait_request(iog)
+	    || elv_iog_wait_busy(iog))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_iog_wait_request(iog);
+	elv_clear_iog_wait_busy(iog);
+	elv_clear_iog_wait_busy_done(iog);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -1928,6 +1981,8 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog = ioq_to_io_group(ioq);
+	int group_wait = 0;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
@@ -1940,6 +1995,24 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 	if (!elv_ioq_busy(ioq))
 		elv_add_ioq_busy(efqd, ioq);
 
+	if (elv_iog_wait_request(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_request(iog);
+		group_wait = 1;
+	}
+
+	/*
+	 * If we were waiting for a request on this group, wait is
+	 * done. Schedule the next dispatch
+	 */
+	if (elv_iog_wait_busy(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_busy(iog);
+		elv_mark_iog_wait_busy_done(iog);
+		elv_schedule_dispatch(q);
+		return;
+	}
+
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
 		 * Remember that we saw a request from this process, but
@@ -1951,7 +2024,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (group_wait || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -1968,6 +2041,13 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
+	} else if (group_wait) {
+		/*
+		 * Got a request in the group we were waiting for. Request
+		 * does not belong to active queue and we have not decided
+		 * to preempt the current active queue. Schedule the dispatch.
+		 */
+		elv_schedule_dispatch(q);
 	}
 }
 
@@ -1985,6 +2065,14 @@ static void elv_idle_slice_timer(unsigned long data)
 	ioq = efqd->active_queue;
 
 	if (ioq) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+
+		elv_clear_iog_wait_request(iog);
+
+		if (elv_iog_wait_busy(iog)) {
+			elv_clear_iog_wait_busy(iog);
+			goto expire;
+		}
 
 		/*
 		 * We saw a request before the queue expired, let it through
@@ -2028,6 +2116,32 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
 }
 
+static void elv_iog_arm_slice_timer(struct request_queue *q,
+				struct io_group *iog, int wait_for_busy)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	unsigned long sl;
+
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return;
+	/*
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
+	 */
+	if (wait_for_busy) {
+		elv_mark_iog_wait_busy(iog);
+		sl = efqd->elv_group_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_iog(efqd, iog, "arm idle group: %lu wait busy=1", sl);
+		return;
+	}
+
+	elv_mark_iog_wait_request(iog);
+	sl = efqd->elv_group_idle;
+	mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+	elv_log_iog(efqd, iog, "arm_idle group: %lu", sl);
+}
+
 /*
  * If io scheduler has functionality of keeping track of close cooperator, check
  * with it if it has got a closely co-operating queue.
@@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	if (ioq == NULL)
 		goto new_queue;
 
+	iog = ioq_to_io_group(ioq);
+
 	/*
 	 * Force dispatch. Continue to dispatch from current queue as long
 	 * as it has requests.
@@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this group to become busy before it expires.*/
+	if (elv_iog_wait_busy(iog)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+	     && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. If this group will be deleted
+		 * after the queue expiry, then make sure we have onece
+		 * done wait busy on the group in an attempt to make it
+		 * backlogged.
+		 *
+		 * Following check helps in two conditions.
+		 * - If there are requests dispatched from the queue and
+		 *   select_ioq() comes before a request completed from the
+		 *   queue and got a chance to arm any of the idle timers.
+		 *
+		 * - If at request completion time slice had not expired and
+		 *   we armed either a ioq timer or group timer but when
+		 *   select_ioq() hits, slice has expired and it will expire
+		 *   the queue without doing busy wait on group.
+		 *
+		 * In similar situations cfq lets delte the queue even if
+		 * idle timer is armed. That does not impact fairness in non
+		 * hierarhical setup due to weighted slice lengths. But in
+		 * hierarchical setup where group slice lengths are derived
+		 * from queue and is not proportional to group's weight, it
+		 * harms the fairness of the group.
+		 */
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * The active queue has requests and isn't expired, allow it to
@@ -2111,6 +2264,12 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/* Check for group idling */
+	if (elv_iog_should_idle(ioq) && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 expire:
 	elv_slice_expired(q);
 new_queue:
@@ -2182,11 +2341,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 	WARN_ON(!efqd->rq_in_driver);
 	WARN_ON(!ioq->dispatched);
 	efqd->rq_in_driver--;
@@ -2212,13 +2373,44 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+			/*
+			 * This is the last empty queue in the group and it
+			 * has consumed its slice. If we expire it right away
+			 * group might loose its share. Wait for an extra
+			 * group_idle period for a request before queue
+			 * expires.
+			 */
+			if (elv_iog_should_idle(ioq)) {
+				elv_iog_arm_slice_timer(q, iog, 1);
+				goto done;
+			}
+
+			/* Expire the queue */
 			elv_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			goto done;
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
+		/*
+		 * If this is the last queue in the group and we did not
+		 * decide to idle on queue, idle on group.
+		 */
+		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		    && !timer_pending(&efqd->idle_slice_timer)) {
+			/*
+			 * If queue has used up its slice, wait for the
+			 * one extra group_idle period to let the group
+			 * backlogged again. This is to avoid a group loosing
+			 * its fair share.
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_iog_arm_slice_timer(q, iog, 1);
+			else
+				elv_iog_arm_slice_timer(q, iog, 0);
+		}
 	}
-
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -2295,6 +2487,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_group_idle = elv_group_idle;
 
 	return 0;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 154014c..0a34c7f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -98,6 +98,7 @@ struct io_queue {
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
 	struct hlist_node elv_data_node;
@@ -172,6 +173,8 @@ struct elv_fq_data {
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
+	unsigned int elv_group_idle;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
 
@@ -240,6 +243,42 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+enum elv_group_state_flags {
+	ELV_GROUP_FLAG_idle_window,	  /* elevator group idling enabled */
+	ELV_GROUP_FLAG_wait_request,	  /* waiting for a request */
+	ELV_GROUP_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_GROUP_FLAG_wait_busy_done,	  /* Have already waited on this group*/
+};
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags |= (1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags &= ~(1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline int elv_iog_##name(struct io_group *iog)         		\
+{                                                                       \
+	return ((iog)->flags & (1 << ELV_GROUP_FLAG_##name)) != 0;	\
+}
+
+#else /* GROUP_IOSCHED */
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog) {}		\
+static inline void elv_clear_iog_##name(struct io_group *iog) {}	\
+static inline int elv_iog_##name(struct io_group *iog) { return 0; }
+#endif /* GROUP_IOSCHED */
+
+ELV_IO_GROUP_FLAG_FNS(idle_window)
+ELV_IO_GROUP_FLAG_FNS(wait_request)
+ELV_IO_GROUP_FLAG_FNS(wait_busy)
+ELV_IO_GROUP_FLAG_FNS(wait_busy_done)
+
 static inline void elv_get_ioq(struct io_queue *ioq)
 {
 	atomic_inc(&ioq->ref);
@@ -365,7 +404,9 @@ extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
 						int create);
-
+extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
+					size_t count);
 static inline void elv_get_iog(struct io_group *iog)
 {
 	atomic_inc(&iog->ref);
@@ -433,6 +474,7 @@ extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+extern int elv_iog_should_idle(struct io_queue *ioq);
 
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 11/23] io-controller: Introduce group idling
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o It is not always that IO from a process or group is continuous. There are
  cases of dependent reads where next read is not issued till previous read
  has finished. For such cases, CFQ introduced the notion of slice_idle,
  where we idle on the queue for sometime hoping next request will come
  and that's how fairness is provided otherwise queue will be deleted
  immediately from the service tree and this process will not get the
  fair share.

o This patch introduces the similar concept at group level. Idle on the group
  for a period of "group_idle" which is tunable through sysfs interface. So
  if a group is empty and about to be deleted, we idle for the next request.

o This patch also introduces the notion of wait busy where we wait for one
  extra group_idle period even if queue has consumed its time slice. The
  reason being that group will loose its share upon removal from service
  tree as some other entity will be picked for dispatch and vtime jump will
  take place.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    5 +-
 block/elevator-fq.c |  207 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   44 +++++++++++-
 3 files changed, 247 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6c1f87a..11ae473 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -980,7 +980,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
-	    cfq_class_idle(cfqq))) {
+	    (cfq_class_idle(cfqq) && !elv_iog_should_idle(cfqq->ioq)))) {
 		cfq_slice_expired(cfqd);
 	}
 
@@ -2121,6 +2121,9 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+#ifdef CONFIG_GROUP_IOSCHED
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cb348fa..6ea5be4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,7 @@
 const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
+int elv_group_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
 /*
@@ -251,6 +252,17 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
 	entity->st = &parent_iog->sched_data.service_tree[idx];
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service trees as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	return iog->sched_data.nr_active;
+}
+
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 static void io_group_path(struct io_group *iog)
 {
@@ -663,6 +675,8 @@ ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return elv_var_show(__data, (page));				\
 }
+SHOW_FUNCTION(elv_group_idle_show, efqd->elv_group_idle, 1);
+EXPORT_SYMBOL(elv_group_idle_show);
 SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
@@ -685,6 +699,8 @@ ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
 		*(__PTR) = __data;					\
 	return ret;							\
 }
+STORE_FUNCTION(elv_group_idle_store, &efqd->elv_group_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_group_idle_store);
 STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
@@ -846,6 +862,31 @@ static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 	entity->my_sd = &iog->sched_data;
 }
 
+/* Check if we plan to idle on the group associated with this queue or not */
+int elv_iog_should_idle(struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * No idling on group if group idle is disabled or idling is disabled
+	 * for this group. Currently for root group idling is disabled.
+	 */
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return 0;
+
+	/*
+	 * If this is last active queue in group with no request queued, we
+	 * need to idle on group before expiring the queue to make sure group
+	 * does not loose its share.
+	 */
+	if ((elv_iog_nr_active(iog) <= 1) && !ioq->nr_queued)
+		return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 {
 	struct io_entity *entity = &iog->entity;
@@ -1213,6 +1254,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		atomic_set(&iog->ref, 0);
 
+		elv_mark_iog_idle_window(iog);
 		/*
 		 * Take the initial reference that will be released on destroy
 		 * This can be thought of a joint reference by cgroup and
@@ -1628,6 +1670,10 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
+/* No group idling in flat mode */
+int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
+EXPORT_SYMBOL(elv_iog_should_idle);
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1688,7 +1734,9 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		ioq->dispatch_start = jiffies;
 
 		elv_clear_ioq_wait_request(ioq);
+		elv_clear_iog_wait_request(iog);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -1787,14 +1835,19 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	long slice_used = 0, slice_overshoot = 0;
+	struct io_group *iog = ioq_to_io_group(ioq);
 
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_iog_wait_request(iog)
+	    || elv_iog_wait_busy(iog))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_iog_wait_request(iog);
+	elv_clear_iog_wait_busy(iog);
+	elv_clear_iog_wait_busy_done(iog);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -1928,6 +1981,8 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog = ioq_to_io_group(ioq);
+	int group_wait = 0;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
@@ -1940,6 +1995,24 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 	if (!elv_ioq_busy(ioq))
 		elv_add_ioq_busy(efqd, ioq);
 
+	if (elv_iog_wait_request(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_request(iog);
+		group_wait = 1;
+	}
+
+	/*
+	 * If we were waiting for a request on this group, wait is
+	 * done. Schedule the next dispatch
+	 */
+	if (elv_iog_wait_busy(iog)) {
+		del_timer(&efqd->idle_slice_timer);
+		elv_clear_iog_wait_busy(iog);
+		elv_mark_iog_wait_busy_done(iog);
+		elv_schedule_dispatch(q);
+		return;
+	}
+
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
 		 * Remember that we saw a request from this process, but
@@ -1951,7 +2024,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (group_wait || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -1968,6 +2041,13 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
+	} else if (group_wait) {
+		/*
+		 * Got a request in the group we were waiting for. Request
+		 * does not belong to active queue and we have not decided
+		 * to preempt the current active queue. Schedule the dispatch.
+		 */
+		elv_schedule_dispatch(q);
 	}
 }
 
@@ -1985,6 +2065,14 @@ static void elv_idle_slice_timer(unsigned long data)
 	ioq = efqd->active_queue;
 
 	if (ioq) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+
+		elv_clear_iog_wait_request(iog);
+
+		if (elv_iog_wait_busy(iog)) {
+			elv_clear_iog_wait_busy(iog);
+			goto expire;
+		}
 
 		/*
 		 * We saw a request before the queue expired, let it through
@@ -2028,6 +2116,32 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 		eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
 }
 
+static void elv_iog_arm_slice_timer(struct request_queue *q,
+				struct io_group *iog, int wait_for_busy)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+	unsigned long sl;
+
+	if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+		return;
+	/*
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
+	 */
+	if (wait_for_busy) {
+		elv_mark_iog_wait_busy(iog);
+		sl = efqd->elv_group_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_iog(efqd, iog, "arm idle group: %lu wait busy=1", sl);
+		return;
+	}
+
+	elv_mark_iog_wait_request(iog);
+	sl = efqd->elv_group_idle;
+	mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+	elv_log_iog(efqd, iog, "arm_idle group: %lu", sl);
+}
+
 /*
  * If io scheduler has functionality of keeping track of close cooperator, check
  * with it if it has got a closely co-operating queue.
@@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	if (ioq == NULL)
 		goto new_queue;
 
+	iog = ioq_to_io_group(ioq);
+
 	/*
 	 * Force dispatch. Continue to dispatch from current queue as long
 	 * as it has requests.
@@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this group to become busy before it expires.*/
+	if (elv_iog_wait_busy(iog)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+	     && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. If this group will be deleted
+		 * after the queue expiry, then make sure we have onece
+		 * done wait busy on the group in an attempt to make it
+		 * backlogged.
+		 *
+		 * Following check helps in two conditions.
+		 * - If there are requests dispatched from the queue and
+		 *   select_ioq() comes before a request completed from the
+		 *   queue and got a chance to arm any of the idle timers.
+		 *
+		 * - If at request completion time slice had not expired and
+		 *   we armed either a ioq timer or group timer but when
+		 *   select_ioq() hits, slice has expired and it will expire
+		 *   the queue without doing busy wait on group.
+		 *
+		 * In similar situations cfq lets delte the queue even if
+		 * idle timer is armed. That does not impact fairness in non
+		 * hierarhical setup due to weighted slice lengths. But in
+		 * hierarchical setup where group slice lengths are derived
+		 * from queue and is not proportional to group's weight, it
+		 * harms the fairness of the group.
+		 */
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * The active queue has requests and isn't expired, allow it to
@@ -2111,6 +2264,12 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/* Check for group idling */
+	if (elv_iog_should_idle(ioq) && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 expire:
 	elv_slice_expired(q);
 new_queue:
@@ -2182,11 +2341,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 	WARN_ON(!efqd->rq_in_driver);
 	WARN_ON(!ioq->dispatched);
 	efqd->rq_in_driver--;
@@ -2212,13 +2373,44 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+			/*
+			 * This is the last empty queue in the group and it
+			 * has consumed its slice. If we expire it right away
+			 * group might loose its share. Wait for an extra
+			 * group_idle period for a request before queue
+			 * expires.
+			 */
+			if (elv_iog_should_idle(ioq)) {
+				elv_iog_arm_slice_timer(q, iog, 1);
+				goto done;
+			}
+
+			/* Expire the queue */
 			elv_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+			goto done;
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
+		/*
+		 * If this is the last queue in the group and we did not
+		 * decide to idle on queue, idle on group.
+		 */
+		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		    && !timer_pending(&efqd->idle_slice_timer)) {
+			/*
+			 * If queue has used up its slice, wait for the
+			 * one extra group_idle period to let the group
+			 * backlogged again. This is to avoid a group loosing
+			 * its fair share.
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_iog_arm_slice_timer(q, iog, 1);
+			else
+				elv_iog_arm_slice_timer(q, iog, 0);
+		}
 	}
-
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -2295,6 +2487,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_group_idle = elv_group_idle;
 
 	return 0;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 154014c..0a34c7f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -98,6 +98,7 @@ struct io_queue {
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
 	struct hlist_node elv_data_node;
@@ -172,6 +173,8 @@ struct elv_fq_data {
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
+	unsigned int elv_group_idle;
+
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
 
@@ -240,6 +243,42 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+enum elv_group_state_flags {
+	ELV_GROUP_FLAG_idle_window,	  /* elevator group idling enabled */
+	ELV_GROUP_FLAG_wait_request,	  /* waiting for a request */
+	ELV_GROUP_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_GROUP_FLAG_wait_busy_done,	  /* Have already waited on this group*/
+};
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags |= (1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_iog_##name(struct io_group *iog)		\
+{                                                                       \
+	(iog)->flags &= ~(1 << ELV_GROUP_FLAG_##name);			\
+}                                                                       \
+static inline int elv_iog_##name(struct io_group *iog)         		\
+{                                                                       \
+	return ((iog)->flags & (1 << ELV_GROUP_FLAG_##name)) != 0;	\
+}
+
+#else /* GROUP_IOSCHED */
+
+#define ELV_IO_GROUP_FLAG_FNS(name)					\
+static inline void elv_mark_iog_##name(struct io_group *iog) {}		\
+static inline void elv_clear_iog_##name(struct io_group *iog) {}	\
+static inline int elv_iog_##name(struct io_group *iog) { return 0; }
+#endif /* GROUP_IOSCHED */
+
+ELV_IO_GROUP_FLAG_FNS(idle_window)
+ELV_IO_GROUP_FLAG_FNS(wait_request)
+ELV_IO_GROUP_FLAG_FNS(wait_busy)
+ELV_IO_GROUP_FLAG_FNS(wait_busy_done)
+
 static inline void elv_get_ioq(struct io_queue *ioq)
 {
 	atomic_inc(&ioq->ref);
@@ -365,7 +404,9 @@ extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
 						int create);
-
+extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
+					size_t count);
 static inline void elv_get_iog(struct io_group *iog)
 {
 	atomic_inc(&iog->ref);
@@ -433,6 +474,7 @@ extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+extern int elv_iog_should_idle(struct io_queue *ioq);
 
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 11/23] io-controller: Introduce group idling Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 13/23] io-controller: Separate out queue and data Vivek Goyal
                     ` (18 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 1.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   21 ++++++++++++++++++++-
 block/elevator-fq.h |   10 +++++++++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 11ae473..52c4710 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2123,6 +2123,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_async),
 #ifdef CONFIG_GROUP_IOSCHED
 	ELV_ATTR(group_idle),
+	ELV_ATTR(fairness),
 #endif
 	__ATTR_NULL
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6ea5be4..840b73b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -681,6 +681,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -705,6 +707,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2271,6 +2275,17 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
 	elv_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -2386,6 +2401,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 			}
 
+			/* Wait for requests to finish from this queue */
+			if (efqd->fairness && elv_ioq_nr_dispatched(ioq))
+				goto done;
+
 			/* Expire the queue */
 			elv_slice_expired(q);
 			goto done;
@@ -2396,7 +2415,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * If this is the last queue in the group and we did not
 		 * decide to idle on queue, idle on group.
 		 */
-		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
 		    && !timer_pending(&efqd->idle_slice_timer)) {
 			/*
 			 * If queue has used up its slice, wait for the
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 0a34c7f..b9f3fc7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -180,6 +180,12 @@ struct elv_fq_data {
 
 	/* Fallback dummy ioq for extreme OOM conditions */
 	struct io_queue oom_ioq;
+
+	/*
+	 * If set to 1, waits for all request completions from current
+	 * queue before new queue is scheduled in
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -437,7 +443,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
-
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+					size_t count);
 /* Functions used by elevator.c */
 extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
 					struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 1.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   21 ++++++++++++++++++++-
 block/elevator-fq.h |   10 +++++++++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 11ae473..52c4710 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2123,6 +2123,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_async),
 #ifdef CONFIG_GROUP_IOSCHED
 	ELV_ATTR(group_idle),
+	ELV_ATTR(fairness),
 #endif
 	__ATTR_NULL
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6ea5be4..840b73b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -681,6 +681,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -705,6 +707,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2271,6 +2275,17 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
 	elv_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -2386,6 +2401,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 			}
 
+			/* Wait for requests to finish from this queue */
+			if (efqd->fairness && elv_ioq_nr_dispatched(ioq))
+				goto done;
+
 			/* Expire the queue */
 			elv_slice_expired(q);
 			goto done;
@@ -2396,7 +2415,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * If this is the last queue in the group and we did not
 		 * decide to idle on queue, idle on group.
 		 */
-		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
 		    && !timer_pending(&efqd->idle_slice_timer)) {
 			/*
 			 * If queue has used up its slice, wait for the
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 0a34c7f..b9f3fc7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -180,6 +180,12 @@ struct elv_fq_data {
 
 	/* Fallback dummy ioq for extreme OOM conditions */
 	struct io_queue oom_ioq;
+
+	/*
+	 * If set to 1, waits for all request completions from current
+	 * queue before new queue is scheduled in
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -437,7 +443,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
-
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+					size_t count);
 /* Functions used by elevator.c */
 extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
 					struct elevator_queue *e);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 1.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   21 ++++++++++++++++++++-
 block/elevator-fq.h |   10 +++++++++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 11ae473..52c4710 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2123,6 +2123,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_async),
 #ifdef CONFIG_GROUP_IOSCHED
 	ELV_ATTR(group_idle),
+	ELV_ATTR(fairness),
 #endif
 	__ATTR_NULL
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6ea5be4..840b73b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -681,6 +681,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -705,6 +707,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2271,6 +2275,17 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
 	elv_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -2386,6 +2401,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 			}
 
+			/* Wait for requests to finish from this queue */
+			if (efqd->fairness && elv_ioq_nr_dispatched(ioq))
+				goto done;
+
 			/* Expire the queue */
 			elv_slice_expired(q);
 			goto done;
@@ -2396,7 +2415,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * If this is the last queue in the group and we did not
 		 * decide to idle on queue, idle on group.
 		 */
-		if (elv_iog_should_idle(ioq) && !ioq->dispatched
+		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
 		    && !timer_pending(&efqd->idle_slice_timer)) {
 			/*
 			 * If queue has used up its slice, wait for the
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 0a34c7f..b9f3fc7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -180,6 +180,12 @@ struct elv_fq_data {
 
 	/* Fallback dummy ioq for extreme OOM conditions */
 	struct io_queue oom_ioq;
+
+	/*
+	 * If set to 1, waits for all request completions from current
+	 * queue before new queue is scheduled in
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -437,7 +443,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
-
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+					size_t count);
 /* Functions used by elevator.c */
 extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
 					struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 13/23] io-controller: Separate out queue and data
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
                     ` (17 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    9 ++-
 5 files changed, 320 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b90acbe..ec6b940 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 25af8b9..5b017da 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -356,10 +384,7 @@ deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -446,13 +471,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index b2725cd..0b7c5a6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -197,17 +197,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *
+elevator_init_data(struct request_queue *q, struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q, eq);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q, eq);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void
+elevator_free_sched_queue(struct elevator_queue *eq, void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *
+elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+				void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -288,7 +325,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -322,13 +359,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -336,6 +381,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1024,7 +1070,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1033,10 +1079,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1053,7 +1107,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1168,16 +1222,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return elv_ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return elv_ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return elv_ioq_sched_queue(elv_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 36fc210..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4414a61..2c6b0c7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,10 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +70,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -109,6 +112,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -255,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 13/23] io-controller: Separate out queue and data
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    9 ++-
 5 files changed, 320 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b90acbe..ec6b940 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 25af8b9..5b017da 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -356,10 +384,7 @@ deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -446,13 +471,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index b2725cd..0b7c5a6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -197,17 +197,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *
+elevator_init_data(struct request_queue *q, struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q, eq);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q, eq);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void
+elevator_free_sched_queue(struct elevator_queue *eq, void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *
+elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+				void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -288,7 +325,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -322,13 +359,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -336,6 +381,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1024,7 +1070,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1033,10 +1079,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1053,7 +1107,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1168,16 +1222,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return elv_ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return elv_ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return elv_ioq_sched_queue(elv_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 36fc210..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4414a61..2c6b0c7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,10 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +70,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -109,6 +112,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -255,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 13/23] io-controller: Separate out queue and data
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    9 ++-
 5 files changed, 320 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b90acbe..ec6b940 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 25af8b9..5b017da 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -356,10 +384,7 @@ deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -446,13 +471,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index b2725cd..0b7c5a6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -197,17 +197,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *
+elevator_init_data(struct request_queue *q, struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q, eq);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q, eq);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void
+elevator_free_sched_queue(struct elevator_queue *eq, void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *
+elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+				void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -288,7 +325,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -322,13 +359,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -336,6 +381,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1024,7 +1070,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1033,10 +1079,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1053,7 +1107,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1168,16 +1222,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return elv_ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return elv_ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return elv_ioq_sched_queue(elv_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 36fc210..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4414a61..2c6b0c7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,10 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +70,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -109,6 +112,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -255,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 13/23] io-controller: Separate out queue and data Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
                     ` (16 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  211 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   36 ++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   18 ++++-
 7 files changed, 300 insertions(+), 8 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index ec6b940..6d2468b 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5b017da..6e69ea3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 840b73b..0289fff 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -767,7 +767,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
 	ioq->efqd = eq->efqd;
-	ioq->pid = pid;
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
 
 	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
 	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
@@ -801,6 +804,12 @@ put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
@@ -1641,6 +1650,172 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 	return (iog == __iog);
 }
 
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void
+elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = elv_io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, current->pid, 1);
+		elv_init_ioq_io_group(ioq, iog);
+		elv_init_ioq_sched_queue(e, ioq, sched_q);
+
+		elv_io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
@@ -1678,6 +1853,11 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1917,6 +2097,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2196,6 +2384,14 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS).
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+		goto keep_queue;
+
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
 		ioq = NULL;
@@ -2382,6 +2578,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_clear_ioq_slice_new(ioq);
 		}
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b9f3fc7..a63308b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -125,6 +125,9 @@ struct io_group {
 	/* Store cgroup path */
 	char path[128];
 #endif
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 struct io_cgroup {
@@ -418,6 +421,11 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
 #else /* !GROUP_IOSCHED */
 
 static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
@@ -435,6 +443,20 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -529,6 +551,20 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 0b7c5a6..bc43edd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -228,9 +228,17 @@ elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -861,6 +869,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -872,6 +887,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_reset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1256,9 +1280,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c6b0c7..77c1fa5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,9 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
-					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+		struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
@@ -245,17 +245,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  211 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   36 ++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   18 ++++-
 7 files changed, 300 insertions(+), 8 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index ec6b940..6d2468b 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5b017da..6e69ea3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 840b73b..0289fff 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -767,7 +767,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
 	ioq->efqd = eq->efqd;
-	ioq->pid = pid;
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
 
 	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
 	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
@@ -801,6 +804,12 @@ put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
@@ -1641,6 +1650,172 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 	return (iog == __iog);
 }
 
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void
+elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = elv_io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, current->pid, 1);
+		elv_init_ioq_io_group(ioq, iog);
+		elv_init_ioq_sched_queue(e, ioq, sched_q);
+
+		elv_io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
@@ -1678,6 +1853,11 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1917,6 +2097,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2196,6 +2384,14 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS).
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+		goto keep_queue;
+
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
 		ioq = NULL;
@@ -2382,6 +2578,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_clear_ioq_slice_new(ioq);
 		}
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b9f3fc7..a63308b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -125,6 +125,9 @@ struct io_group {
 	/* Store cgroup path */
 	char path[128];
 #endif
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 struct io_cgroup {
@@ -418,6 +421,11 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
 #else /* !GROUP_IOSCHED */
 
 static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
@@ -435,6 +443,20 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -529,6 +551,20 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 0b7c5a6..bc43edd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -228,9 +228,17 @@ elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -861,6 +869,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -872,6 +887,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_reset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1256,9 +1280,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c6b0c7..77c1fa5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,9 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
-					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+		struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
@@ -245,17 +245,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  211 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   36 ++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   18 ++++-
 7 files changed, 300 insertions(+), 8 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index ec6b940..6d2468b 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5b017da..6e69ea3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 840b73b..0289fff 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -767,7 +767,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
 	ioq->efqd = eq->efqd;
-	ioq->pid = pid;
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
 
 	elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
 	elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
@@ -801,6 +804,12 @@ put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
@@ -1641,6 +1650,172 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 	return (iog == __iog);
 }
 
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void
+elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = elv_io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, current->pid, 1);
+		elv_init_ioq_io_group(ioq, iog);
+		elv_init_ioq_sched_queue(e, ioq, sched_q);
+
+		elv_io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
 static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
@@ -1678,6 +1853,11 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED */
 
 /*
@@ -1917,6 +2097,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2196,6 +2384,14 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS).
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+		goto keep_queue;
+
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
 		ioq = NULL;
@@ -2382,6 +2578,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_clear_ioq_slice_new(ioq);
 		}
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b9f3fc7..a63308b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -125,6 +125,9 @@ struct io_group {
 	/* Store cgroup path */
 	char path[128];
 #endif
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 struct io_cgroup {
@@ -418,6 +421,11 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
 #else /* !GROUP_IOSCHED */
 
 static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
@@ -435,6 +443,20 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -529,6 +551,20 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 0b7c5a6..bc43edd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -228,9 +228,17 @@ elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -861,6 +869,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -872,6 +887,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_reset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1256,9 +1280,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c6b0c7..77c1fa5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,9 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 typedef void *(elevator_init_fn) (struct request_queue *,
 					struct elevator_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
-					struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+		struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
@@ -245,17 +245,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 16/23] io-controller: deadline " Vivek Goyal
                     ` (15 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   14 ++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a7d0bf8..28cd500 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..4ba496f 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -6,6 +6,7 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include "elevator-fq.h"
 
 struct noop_queue {
 	struct list_head queue;
@@ -82,6 +83,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +102,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   14 ++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a7d0bf8..28cd500 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..4ba496f 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -6,6 +6,7 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include "elevator-fq.h"
 
 struct noop_queue {
 	struct list_head queue;
@@ -82,6 +83,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +102,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   14 ++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a7d0bf8..28cd500 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..4ba496f 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -6,6 +6,7 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include "elevator-fq.h"
 
 struct noop_queue {
 	struct list_head queue;
@@ -82,6 +83,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +102,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 17/23] io-controller: anticipatory " Vivek Goyal
                     ` (14 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    9 +++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 28cd500..cc87c87 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 6e69ea3..e5bc823 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/deadline-iosched.txt
@@ -461,6 +462,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -478,6 +484,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    9 +++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 28cd500..cc87c87 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 6e69ea3..e5bc823 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/deadline-iosched.txt
@@ -461,6 +462,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -478,6 +484,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    9 +++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 28cd500..cc87c87 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 6e69ea3..e5bc823 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/deadline-iosched.txt
@@ -461,6 +462,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -478,6 +484,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 16/23] io-controller: deadline " Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios Vivek Goyal
                     ` (13 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  295 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   89 ++++++++++++--
 block/elevator-fq.h      |    2 +
 include/linux/elevator.h |    2 +
 5 files changed, 382 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index cc87c87..8ab08da 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2468b..2a9cd06 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,8 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -77,6 +79,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -84,10 +87,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +140,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +164,185 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +622,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +636,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +651,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +672,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +851,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1024,21 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1053,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1064,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1073,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1108,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1126,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1153,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1267,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1316,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1329,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1349,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1369,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log_asq(ad, asq, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1385,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1410,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1446,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1472,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1358,6 +1628,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
@@ -1408,6 +1679,7 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1765,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1791,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0289fff..a14fa72 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1922,6 +1922,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -1995,6 +1996,46 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * if force = 1, it is force dispatch and iosched must clean up its state.
+ * This is useful when elevator wants to drain iosched and wants to expire
+ * currnent active queue.
+ * if slice_expired = 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until and unless
+ * necessary. Currently AS can deny the switch if in the middle of batch switch.
+ *
+ * if slice_expired = 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to expire it and move
+ * on to next queue.
+ */
+static int
+elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2032,6 +2073,7 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_iog_wait_request(iog);
 	elv_clear_iog_wait_busy(iog);
 	elv_clear_iog_wait_busy_done(iog);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -2157,16 +2199,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
-	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
-	elv_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+		elv_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	requeue_ioq(ioq, 1);
-	elv_mark_ioq_slice_new(ioq);
+		requeue_ioq(ioq, 1);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2364,6 +2408,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2384,6 +2430,10 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
@@ -2470,19 +2520,32 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
 		 * this queue have completed.
+		 *
+		 * Anticipatory does not allow queue switch until requests
+		 * from previous queue have finished.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
 		ioq = NULL;
 		goto keep_queue;
 	}
-	elv_slice_expired(q);
+
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -2615,8 +2678,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 
 			/* Expire the queue */
-			elv_slice_expired(q);
-			goto done;
+			if (elv_iosched_expire_ioq(q, 1, 0)) {
+				elv_slice_expired(q);
+				goto done;
+			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a63308b..95ed680 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -229,6 +229,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -251,6 +252,7 @@ ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 #ifdef CONFIG_GROUP_IOSCHED
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 77c1fa5..3d4e31c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -41,6 +41,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -79,6 +80,7 @@ struct elevator_ops
 	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  295 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   89 ++++++++++++--
 block/elevator-fq.h      |    2 +
 include/linux/elevator.h |    2 +
 5 files changed, 382 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index cc87c87..8ab08da 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2468b..2a9cd06 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,8 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -77,6 +79,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -84,10 +87,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +140,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +164,185 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +622,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +636,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +651,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +672,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +851,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1024,21 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1053,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1064,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1073,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1108,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1126,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1153,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1267,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1316,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1329,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1349,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1369,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log_asq(ad, asq, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1385,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1410,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1446,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1472,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1358,6 +1628,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
@@ -1408,6 +1679,7 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1765,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1791,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0289fff..a14fa72 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1922,6 +1922,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -1995,6 +1996,46 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * if force = 1, it is force dispatch and iosched must clean up its state.
+ * This is useful when elevator wants to drain iosched and wants to expire
+ * currnent active queue.
+ * if slice_expired = 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until and unless
+ * necessary. Currently AS can deny the switch if in the middle of batch switch.
+ *
+ * if slice_expired = 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to expire it and move
+ * on to next queue.
+ */
+static int
+elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2032,6 +2073,7 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_iog_wait_request(iog);
 	elv_clear_iog_wait_busy(iog);
 	elv_clear_iog_wait_busy_done(iog);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -2157,16 +2199,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
-	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
-	elv_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+		elv_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	requeue_ioq(ioq, 1);
-	elv_mark_ioq_slice_new(ioq);
+		requeue_ioq(ioq, 1);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2364,6 +2408,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2384,6 +2430,10 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
@@ -2470,19 +2520,32 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
 		 * this queue have completed.
+		 *
+		 * Anticipatory does not allow queue switch until requests
+		 * from previous queue have finished.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
 		ioq = NULL;
 		goto keep_queue;
 	}
-	elv_slice_expired(q);
+
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -2615,8 +2678,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 
 			/* Expire the queue */
-			elv_slice_expired(q);
-			goto done;
+			if (elv_iosched_expire_ioq(q, 1, 0)) {
+				elv_slice_expired(q);
+				goto done;
+			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a63308b..95ed680 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -229,6 +229,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -251,6 +252,7 @@ ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 #ifdef CONFIG_GROUP_IOSCHED
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 77c1fa5..3d4e31c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -41,6 +41,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -79,6 +80,7 @@ struct elevator_ops
 	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  295 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   89 ++++++++++++--
 block/elevator-fq.h      |    2 +
 include/linux/elevator.h |    2 +
 5 files changed, 382 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index cc87c87..8ab08da 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2468b..2a9cd06 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,8 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
+#include "elevator-fq.h"
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -77,6 +79,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -84,10 +87,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +140,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +164,185 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +622,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +636,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +651,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +672,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +851,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1024,21 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1053,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1064,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1073,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1108,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1126,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1153,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1267,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1316,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1329,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1349,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1369,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log_asq(ad, asq, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1385,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1410,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1446,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1472,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1358,6 +1628,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
@@ -1408,6 +1679,7 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1765,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(group_idle),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1791,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0289fff..a14fa72 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1922,6 +1922,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_iog_wait_busy_done(iog);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -1995,6 +1996,46 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * if force = 1, it is force dispatch and iosched must clean up its state.
+ * This is useful when elevator wants to drain iosched and wants to expire
+ * currnent active queue.
+ * if slice_expired = 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until and unless
+ * necessary. Currently AS can deny the switch if in the middle of batch switch.
+ *
+ * if slice_expired = 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to expire it and move
+ * on to next queue.
+ */
+static int
+elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2032,6 +2073,7 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_iog_wait_request(iog);
 	elv_clear_iog_wait_busy(iog);
 	elv_clear_iog_wait_busy_done(iog);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * Queue got expired before even a single request completed or
@@ -2157,16 +2199,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
-	elv_log_ioq(q->elevator->efqd, ioq, "preempt");
-	elv_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+		elv_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	requeue_ioq(ioq, 1);
-	elv_mark_ioq_slice_new(ioq);
+		requeue_ioq(ioq, 1);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2364,6 +2408,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2384,6 +2430,10 @@ void *elv_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
@@ -2470,19 +2520,32 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
 		 * this queue have completed.
+		 *
+		 * Anticipatory does not allow queue switch until requests
+		 * from previous queue have finished.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
 		ioq = NULL;
 		goto keep_queue;
 	}
-	elv_slice_expired(q);
+
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -2615,8 +2678,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				goto done;
 
 			/* Expire the queue */
-			elv_slice_expired(q);
-			goto done;
+			if (elv_iosched_expire_ioq(q, 1, 0)) {
+				elv_slice_expired(q);
+				goto done;
+			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a63308b..95ed680 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -229,6 +229,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -251,6 +252,7 @@ ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 #ifdef CONFIG_GROUP_IOSCHED
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 77c1fa5..3d4e31c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -41,6 +41,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -79,6 +80,7 @@ struct elevator_ops
 	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 17/23] io-controller: anticipatory " Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 19/23] io-controller: map async requests to appropriate cgroup Vivek Goyal
                     ` (12 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o blkio_cgroup patches from Ryo to track async bios.

o This functionality is used to determine the group of async IO from page
  instead of context of submitting task.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-ioc.c               |   36 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |    5 +-
 init/Kconfig                  |   16 +++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  293 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   23 ++--
 mm/swap_state.c               |    2 +
 19 files changed, 486 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 28f320f..8efcd82 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..2b8bb0b
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b343594..1baa6c1 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..bca6c8a 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index afcaa86..54aa85a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,22 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	default n
+	---help---
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1da7d1e
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..7ad8d44 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..01c47a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..baf4be7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index aede2ce..346f368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..1df421b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..29bf26c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -254,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o blkio_cgroup patches from Ryo to track async bios.

o This functionality is used to determine the group of async IO from page
  instead of context of submitting task.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |   36 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |    5 +-
 init/Kconfig                  |   16 +++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  293 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   23 ++--
 mm/swap_state.c               |    2 +
 19 files changed, 486 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 28f320f..8efcd82 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..2b8bb0b
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b343594..1baa6c1 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..bca6c8a 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index afcaa86..54aa85a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,22 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	default n
+	---help---
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1da7d1e
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..7ad8d44 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..01c47a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..baf4be7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index aede2ce..346f368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..1df421b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..29bf26c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -254,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o blkio_cgroup patches from Ryo to track async bios.

o This functionality is used to determine the group of async IO from page
  instead of context of submitting task.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |   36 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |    5 +-
 init/Kconfig                  |   16 +++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  293 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   23 ++--
 mm/swap_state.c               |    2 +
 19 files changed, 486 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 28f320f..8efcd82 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..2b8bb0b
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b343594..1baa6c1 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..bca6c8a 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index afcaa86..54aa85a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,22 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	default n
+	---help---
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1da7d1e
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..7ad8d44 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..01c47a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..baf4be7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index aede2ce..346f368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..1df421b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..29bf26c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -254,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 19/23] io-controller: map async requests to appropriate cgroup
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 20/23] io-controller: Per cgroup request descriptor support Vivek Goyal
                     ` (11 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |   91 +++++++++++++++++++++++----
 block/elevator-fq.h      |   31 ++++++---
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   22 ++++++-
 9 files changed, 266 insertions(+), 72 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8ab08da..8b507c4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -132,6 +132,22 @@ config DEBUG_GROUP_IOSCHED
 	  Enable some debugging hooks for hierarchical scheduling support.
 	  Currently it just outputs more information in blktrace output.
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 2a9cd06..8ea9398 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1515,7 +1515,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..47cce59 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,7 +619,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -631,7 +632,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -772,7 +773,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 52c4710..034b5ca 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -176,8 +176,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -187,22 +187,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = elv_io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -526,7 +560,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -609,7 +643,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -620,7 +654,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1244,14 +1278,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1302,7 +1350,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1340,7 +1388,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1350,12 +1398,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	/*
 	 * Always try a new alloc if we fell back to the OOM cfqq
 	 * originally, since it should just be a temporary situation.
@@ -1432,14 +1496,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = elv_io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1448,14 +1512,35 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 
 	if (!is_sync && !async_cfqq)
 		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
 							cfqq->ioq);
-
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1908,7 +1993,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1928,7 +2014,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	}
 
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index e5bc823..cc9c8c3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -134,7 +134,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a14fa72..9c8783c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -917,6 +918,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1376,9 +1380,45 @@ end:
 	return iog;
 }
 
+struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd->root_group;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return elv_io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return elv_io_get_io_group(q, page, create);
+}
+EXPORT_SYMBOL(elv_io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1386,28 +1426,45 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+struct io_group *
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
+		iog = efqd->root_group;
+		goto out;
+	}
+
 	iog = io_find_alloc_group(q, cgroup, efqd, create);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1637,7 +1694,7 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1670,7 +1727,7 @@ elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
  * function is not invoked.
  */
 int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1686,7 +1743,7 @@ int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1773,18 +1830,20 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
-		 * not been setup yet. */
+		/*
+		 * May be bio belongs to a cgroup for which io group has
+		 * not been setup yet.
+		 */
 		return NULL;
 	}
 	return iog->ioq;
@@ -2549,6 +2608,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 95ed680..9fe52fa 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -414,7 +414,9 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
-						int create);
+					struct page *page, int create);
+extern struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
 					size_t count);
@@ -424,9 +426,10 @@ static inline void elv_get_iog(struct io_group *iog)
 }
 
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 #else /* !GROUP_IOSCHED */
 
@@ -439,14 +442,20 @@ static inline void elv_get_iog(struct io_group *iog) {}
 static inline void elv_put_iog(struct io_group *iog) {}
 
 static inline struct io_group *
-elv_io_get_io_group(struct request_queue *q, int create)
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd->root_group;
 }
 
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline struct io_group *
+elv_io_get_io_group_bio(struct request_queue *q, struct bio *bio, int create)
+{
+	return q->elevator->efqd->root_group;
+}
+
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -454,7 +463,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *
+elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
@@ -553,8 +563,8 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -562,7 +572,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index bc43edd..4ed37b6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -865,7 +865,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -874,10 +875,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_set_request_ioq(q, rq, gfp_mask);
+		return elv_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1279,19 +1280,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
+	return elv_ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d4e31c..0ace96e 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,7 +22,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -146,7 +147,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -275,6 +277,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 19/23] io-controller: map async requests to appropriate cgroup
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |   91 +++++++++++++++++++++++----
 block/elevator-fq.h      |   31 ++++++---
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   22 ++++++-
 9 files changed, 266 insertions(+), 72 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8ab08da..8b507c4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -132,6 +132,22 @@ config DEBUG_GROUP_IOSCHED
 	  Enable some debugging hooks for hierarchical scheduling support.
 	  Currently it just outputs more information in blktrace output.
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 2a9cd06..8ea9398 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1515,7 +1515,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..47cce59 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,7 +619,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -631,7 +632,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -772,7 +773,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 52c4710..034b5ca 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -176,8 +176,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -187,22 +187,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = elv_io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -526,7 +560,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -609,7 +643,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -620,7 +654,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1244,14 +1278,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1302,7 +1350,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1340,7 +1388,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1350,12 +1398,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	/*
 	 * Always try a new alloc if we fell back to the OOM cfqq
 	 * originally, since it should just be a temporary situation.
@@ -1432,14 +1496,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = elv_io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1448,14 +1512,35 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 
 	if (!is_sync && !async_cfqq)
 		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
 							cfqq->ioq);
-
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1908,7 +1993,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1928,7 +2014,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	}
 
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index e5bc823..cc9c8c3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -134,7 +134,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a14fa72..9c8783c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -917,6 +918,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1376,9 +1380,45 @@ end:
 	return iog;
 }
 
+struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd->root_group;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return elv_io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return elv_io_get_io_group(q, page, create);
+}
+EXPORT_SYMBOL(elv_io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1386,28 +1426,45 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+struct io_group *
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
+		iog = efqd->root_group;
+		goto out;
+	}
+
 	iog = io_find_alloc_group(q, cgroup, efqd, create);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1637,7 +1694,7 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1670,7 +1727,7 @@ elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
  * function is not invoked.
  */
 int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1686,7 +1743,7 @@ int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1773,18 +1830,20 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
-		 * not been setup yet. */
+		/*
+		 * May be bio belongs to a cgroup for which io group has
+		 * not been setup yet.
+		 */
 		return NULL;
 	}
 	return iog->ioq;
@@ -2549,6 +2608,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 95ed680..9fe52fa 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -414,7 +414,9 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
-						int create);
+					struct page *page, int create);
+extern struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
 					size_t count);
@@ -424,9 +426,10 @@ static inline void elv_get_iog(struct io_group *iog)
 }
 
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 #else /* !GROUP_IOSCHED */
 
@@ -439,14 +442,20 @@ static inline void elv_get_iog(struct io_group *iog) {}
 static inline void elv_put_iog(struct io_group *iog) {}
 
 static inline struct io_group *
-elv_io_get_io_group(struct request_queue *q, int create)
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd->root_group;
 }
 
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline struct io_group *
+elv_io_get_io_group_bio(struct request_queue *q, struct bio *bio, int create)
+{
+	return q->elevator->efqd->root_group;
+}
+
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -454,7 +463,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *
+elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
@@ -553,8 +563,8 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -562,7 +572,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index bc43edd..4ed37b6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -865,7 +865,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -874,10 +875,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_set_request_ioq(q, rq, gfp_mask);
+		return elv_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1279,19 +1280,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
+	return elv_ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d4e31c..0ace96e 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,7 +22,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -146,7 +147,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -275,6 +277,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 19/23] io-controller: map async requests to appropriate cgroup
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |   91 +++++++++++++++++++++++----
 block/elevator-fq.h      |   31 ++++++---
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   22 ++++++-
 9 files changed, 266 insertions(+), 72 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8ab08da..8b507c4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -132,6 +132,22 @@ config DEBUG_GROUP_IOSCHED
 	  Enable some debugging hooks for hierarchical scheduling support.
 	  Currently it just outputs more information in blktrace output.
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 2a9cd06..8ea9398 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1515,7 +1515,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..47cce59 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,7 +619,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -631,7 +632,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -772,7 +773,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 52c4710..034b5ca 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -176,8 +176,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -187,22 +187,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = elv_io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -526,7 +560,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -609,7 +643,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -620,7 +654,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1244,14 +1278,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1302,7 +1350,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1340,7 +1388,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1350,12 +1398,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_group *iog = NULL;
 
 retry:
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	/*
 	 * Always try a new alloc if we fell back to the OOM cfqq
 	 * originally, since it should just be a temporary situation.
@@ -1432,14 +1496,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = elv_io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1448,14 +1512,35 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 
 	if (!is_sync && !async_cfqq)
 		elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
 							cfqq->ioq);
-
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1908,7 +1993,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1928,7 +2014,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	}
 
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index e5bc823..cc9c8c3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -134,7 +134,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a14fa72..9c8783c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -917,6 +918,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1376,9 +1380,45 @@ end:
 	return iog;
 }
 
+struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd->root_group;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return elv_io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return elv_io_get_io_group(q, page, create);
+}
+EXPORT_SYMBOL(elv_io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1386,28 +1426,45 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+struct io_group *
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
+		iog = efqd->root_group;
+		goto out;
+	}
+
 	iog = io_find_alloc_group(q, cgroup, efqd, create);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1637,7 +1694,7 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1670,7 +1727,7 @@ elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
  * function is not invoked.
  */
 int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1686,7 +1743,7 @@ int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = elv_io_get_io_group(q, 1);
+	iog = elv_io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1773,18 +1830,20 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = elv_io_get_io_group(q, 0);
+	iog = elv_io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
-		 * not been setup yet. */
+		/*
+		 * May be bio belongs to a cgroup for which io group has
+		 * not been setup yet.
+		 */
 		return NULL;
 	}
 	return iog->ioq;
@@ -2549,6 +2608,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 95ed680..9fe52fa 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -414,7 +414,9 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_put_iog(struct io_group *iog);
 extern struct io_group *elv_io_get_io_group(struct request_queue *q,
-						int create);
+					struct page *page, int create);
+extern struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
 					size_t count);
@@ -424,9 +426,10 @@ static inline void elv_get_iog(struct io_group *iog)
 }
 
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 #else /* !GROUP_IOSCHED */
 
@@ -439,14 +442,20 @@ static inline void elv_get_iog(struct io_group *iog) {}
 static inline void elv_put_iog(struct io_group *iog) {}
 
 static inline struct io_group *
-elv_io_get_io_group(struct request_queue *q, int create)
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd->root_group;
 }
 
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline struct io_group *
+elv_io_get_io_group_bio(struct request_queue *q, struct bio *bio, int create)
+{
+	return q->elevator->efqd->root_group;
+}
+
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -454,7 +463,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *
+elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
@@ -553,8 +563,8 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline int elv_set_request_ioq(struct request_queue *q,
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -562,7 +572,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 static inline void
 elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index bc43edd..4ed37b6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -865,7 +865,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -874,10 +875,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_set_request_ioq(q, rq, gfp_mask);
+		return elv_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1279,19 +1280,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
+	return elv_ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d4e31c..0ace96e 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,7 +22,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -146,7 +147,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -275,6 +277,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 20/23] io-controller: Per cgroup request descriptor support
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 19/23] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 21/23] io-controller: Per io group bdi congestion interface Vivek Goyal
                     ` (10 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
 block/blk-settings.c         |    1 +
 block/blk-sysfs.c            |   59 ++++++--
 block/elevator-fq.c          |   36 +++++
 block/elevator-fq.h          |   29 ++++
 block/elevator.c             |    7 +-
 include/linux/blkdev.h       |   47 ++++++-
 include/trace/events/block.h |    6 +-
 kernel/trace/blktrace.c      |    6 +-
 9 files changed, 421 insertions(+), 87 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 47cce59..18b400b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+struct request_list *
+blk_get_request_list(struct request_queue *q, struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	/*
+	 * Determine which request list bio will be allocated from. This
+	 * is dependent on which io group bio belongs to
+	 */
+	return elv_get_request_list_bio(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	return elv_get_request_list_rq(q, rq, priv);
+#else
+	return &q->rq;
+#endif
+}
+
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * In case of group scheduling, request list is inside group and is
+	 * initialized when group is instanciated.
+	 */
+#ifndef CONFIG_GROUP_IOSCHED
+	blk_init_request_list(&q->rq);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
 	q->queue_lock		= lock;
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * This also sets hw/phys segments, boundary and size
 	 */
@@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
+	/*
+	 * There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/* queue full seems redundant now */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set it as full, and mark this process as
+		 * "batching". This process will be allowed to complete a
+		 * batch of requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -810,6 +951,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+			elv_get_rl_iog(rl);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			/*
+			 * We had taken a reference to the rl/iog. Put that now
+			 */
+			finish_wait(&rl->wait[is_sync], &wait);
+			elv_put_rl_iog(rl);
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 476d870..c3102c7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 418d636..f3db7f0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9c8783c..39896c2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		iog = q->elevator->efqd->root_group;
+	else
+		iog = elv_io_get_io_group_bio(q, bio, 1);
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->elevator->efqd->root_group->rl;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd->root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 		io_group_path(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9fe52fa..989102e 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -128,6 +128,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 struct io_cgroup {
@@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+	return container_of(rl, struct io_group, rl);
+}
+
+static inline void elv_get_rl_iog(struct request_list *rl)
+{
+	elv_get_iog(rl_iog(rl));
+}
+
+static inline void elv_put_rl_iog(struct request_list *rl)
+{
+	elv_put_iog(rl_iog(rl));
+}
+
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
 
 #else /* !GROUP_IOSCHED */
 
@@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 4ed37b6..b23db03 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7cff5f2..74deb17 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -339,10 +369,17 @@ struct request_queue
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
 
+#ifndef CONFIG_GROUP_IOSCHED
 	/*
 	 * the queue request freelist, one for reads and one for writes
+	 * In case of group io scheduling, this request list is per group
+	 * and is present in group data structure.
 	 */
 	struct request_list	rq;
+#endif
+
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
@@ -405,6 +442,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+extern struct request_list *blk_get_request_list(struct request_queue *q,
+							struct bio *bio);
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 9a74b46..af6c9e5 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
@@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7a34cb5..9a03980 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
@@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 20/23] io-controller: Per cgroup request descriptor support
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
 block/blk-settings.c         |    1 +
 block/blk-sysfs.c            |   59 ++++++--
 block/elevator-fq.c          |   36 +++++
 block/elevator-fq.h          |   29 ++++
 block/elevator.c             |    7 +-
 include/linux/blkdev.h       |   47 ++++++-
 include/trace/events/block.h |    6 +-
 kernel/trace/blktrace.c      |    6 +-
 9 files changed, 421 insertions(+), 87 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 47cce59..18b400b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+struct request_list *
+blk_get_request_list(struct request_queue *q, struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	/*
+	 * Determine which request list bio will be allocated from. This
+	 * is dependent on which io group bio belongs to
+	 */
+	return elv_get_request_list_bio(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	return elv_get_request_list_rq(q, rq, priv);
+#else
+	return &q->rq;
+#endif
+}
+
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * In case of group scheduling, request list is inside group and is
+	 * initialized when group is instanciated.
+	 */
+#ifndef CONFIG_GROUP_IOSCHED
+	blk_init_request_list(&q->rq);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
 	q->queue_lock		= lock;
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * This also sets hw/phys segments, boundary and size
 	 */
@@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
+	/*
+	 * There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/* queue full seems redundant now */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set it as full, and mark this process as
+		 * "batching". This process will be allowed to complete a
+		 * batch of requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -810,6 +951,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+			elv_get_rl_iog(rl);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			/*
+			 * We had taken a reference to the rl/iog. Put that now
+			 */
+			finish_wait(&rl->wait[is_sync], &wait);
+			elv_put_rl_iog(rl);
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 476d870..c3102c7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 418d636..f3db7f0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9c8783c..39896c2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		iog = q->elevator->efqd->root_group;
+	else
+		iog = elv_io_get_io_group_bio(q, bio, 1);
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->elevator->efqd->root_group->rl;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd->root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 		io_group_path(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9fe52fa..989102e 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -128,6 +128,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 struct io_cgroup {
@@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+	return container_of(rl, struct io_group, rl);
+}
+
+static inline void elv_get_rl_iog(struct request_list *rl)
+{
+	elv_get_iog(rl_iog(rl));
+}
+
+static inline void elv_put_rl_iog(struct request_list *rl)
+{
+	elv_put_iog(rl_iog(rl));
+}
+
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
 
 #else /* !GROUP_IOSCHED */
 
@@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 4ed37b6..b23db03 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7cff5f2..74deb17 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -339,10 +369,17 @@ struct request_queue
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
 
+#ifndef CONFIG_GROUP_IOSCHED
 	/*
 	 * the queue request freelist, one for reads and one for writes
+	 * In case of group io scheduling, this request list is per group
+	 * and is present in group data structure.
 	 */
 	struct request_list	rq;
+#endif
+
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
@@ -405,6 +442,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+extern struct request_list *blk_get_request_list(struct request_queue *q,
+							struct bio *bio);
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 9a74b46..af6c9e5 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
@@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7a34cb5..9a03980 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
@@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 20/23] io-controller: Per cgroup request descriptor support
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
 block/blk-settings.c         |    1 +
 block/blk-sysfs.c            |   59 ++++++--
 block/elevator-fq.c          |   36 +++++
 block/elevator-fq.h          |   29 ++++
 block/elevator.c             |    7 +-
 include/linux/blkdev.h       |   47 ++++++-
 include/trace/events/block.h |    6 +-
 kernel/trace/blktrace.c      |    6 +-
 9 files changed, 421 insertions(+), 87 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 47cce59..18b400b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+struct request_list *
+blk_get_request_list(struct request_queue *q, struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	/*
+	 * Determine which request list bio will be allocated from. This
+	 * is dependent on which io group bio belongs to
+	 */
+	return elv_get_request_list_bio(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	return elv_get_request_list_rq(q, rq, priv);
+#else
+	return &q->rq;
+#endif
+}
+
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * In case of group scheduling, request list is inside group and is
+	 * initialized when group is instanciated.
+	 */
+#ifndef CONFIG_GROUP_IOSCHED
+	blk_init_request_list(&q->rq);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
 	q->queue_lock		= lock;
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * This also sets hw/phys segments, boundary and size
 	 */
@@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
+	/*
+	 * There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/* queue full seems redundant now */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set it as full, and mark this process as
+		 * "batching". This process will be allowed to complete a
+		 * batch of requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -810,6 +951,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+			elv_get_rl_iog(rl);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			/*
+			 * We had taken a reference to the rl/iog. Put that now
+			 */
+			finish_wait(&rl->wait[is_sync], &wait);
+			elv_put_rl_iog(rl);
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 476d870..c3102c7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 418d636..f3db7f0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9c8783c..39896c2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		iog = q->elevator->efqd->root_group;
+	else
+		iog = elv_io_get_io_group_bio(q, bio, 1);
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
+{
+	struct io_group *iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->elevator->efqd->root_group->rl;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd->root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 		io_group_path(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 9fe52fa..989102e 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -128,6 +128,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 struct io_cgroup {
@@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
 	atomic_inc(&iog->ref);
 }
 
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+	return container_of(rl, struct io_group, rl);
+}
+
+static inline void elv_get_rl_iog(struct request_list *rl)
+{
+	elv_get_iog(rl_iog(rl));
+}
+
+static inline void elv_put_rl_iog(struct request_list *rl)
+{
+	elv_put_iog(rl_iog(rl));
+}
+
 extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
 					struct bio *bio, gfp_t gfp_mask);
 extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
 
 #else /* !GROUP_IOSCHED */
 
@@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
 #endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 4ed37b6..b23db03 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7cff5f2..74deb17 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -339,10 +369,17 @@ struct request_queue
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
 
+#ifndef CONFIG_GROUP_IOSCHED
 	/*
 	 * the queue request freelist, one for reads and one for writes
+	 * In case of group io scheduling, this request list is per group
+	 * and is present in group data structure.
 	 */
 	struct request_list	rq;
+#endif
+
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
@@ -405,6 +442,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+extern struct request_list *blk_get_request_list(struct request_queue *q,
+							struct bio *bio);
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 9a74b46..af6c9e5 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
@@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
 	),
 
 	TP_fast_assign(
-		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		__entry->nr_rq	= q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7a34cb5..9a03980 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
@@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
 	struct blk_trace *bt = q->blk_trace;
 
 	if (bt) {
-		unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+		unsigned int pdu = q->rq_data.count[READ] +
+					q->rq_data.count[WRITE];
 		__be64 rpdu = cpu_to_be64(pdu);
 
 		__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 21/23] io-controller: Per io group bdi congestion interface
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 20/23] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 22/23] io-controller: Support per cgroup per device weights and io class Vivek Goyal
                     ` (9 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c            |   26 ++++++++
 block/blk-sysfs.c           |    6 +-
 block/elevator-fq.c         |  135 +++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |   24 +++++++-
 drivers/md/dm-table.c       |   11 ++-
 drivers/md/dm.c             |    7 +-
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 ++-
 drivers/md/multipath.c      |    7 ++-
 drivers/md/raid0.c          |    6 +-
 drivers/md/raid1.c          |    9 ++-
 drivers/md/raid10.c         |    6 +-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 ++-
 fs/btrfs/disk-io.c          |    6 +-
 fs/btrfs/extent_io.c        |   12 ++++
 fs/btrfs/volumes.c          |    8 ++-
 fs/cifs/file.c              |   11 ++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   63 +++++++++++++++++++-
 include/linux/blkdev.h      |    5 ++
 mm/backing-dev.c            |   74 ++++++++++++++++++++++-
 mm/page-writeback.c         |   11 ++++
 mm/readahead.c              |    2 +-
 28 files changed, 430 insertions(+), 40 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b400b..112a629 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q || !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_page_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_page_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
@@ -721,6 +742,8 @@ static void __freed_request(struct request_queue *q, int sync,
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
+	elv_freed_request(rl, sync);
+
 	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
@@ -830,6 +853,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, is_sync);
 
+	/* check if io group will get congested after this allocation*/
+	elv_get_request(rl, is_sync);
+
 	/* queue full seems redundant now */
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3db7f0..e0af5d6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,9 +83,8 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
+static ssize_t queue_group_requests_store(struct request_queue *q,
+					const char *page, size_t count)
 {
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
@@ -95,6 +94,7 @@ queue_group_requests_store(struct request_queue *q, const char *page,
 
 	spin_lock_irq(q->queue_lock);
 	q->nr_group_requests = nr;
+	elv_updated_nr_group_requests(q);
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39896c2..b43ac2f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -958,6 +958,139 @@ elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+void elv_clear_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	clear_bit(bit, &iog->state);
+	smp_mb__after_clear_bit();
+	congestion_wake_up(sync);
+}
+
+void elv_set_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	set_bit(bit, &iog->state);
+}
+
+static inline int elv_iog_congested(struct io_group *iog, int iog_bits)
+{
+	return iog->state & iog_bits;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_page_io_group_congested(struct request_queue *q, struct page *page,
+								int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = elv_io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd->root_group;
+	}
+
+	if (sync)
+		ret = elv_iog_congested(iog, 1 << IOG_sync_congested);
+	else
+		ret = elv_iog_congested(iog, 1 << IOG_async_congested);
+
+	if (ret)
+		elv_log_iog(q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline int
+elv_iog_congestion_on_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_on;
+}
+
+static inline int
+elv_iog_congestion_off_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_off;
+}
+
+void elv_freed_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync] < elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, sync);
+}
+
+void elv_get_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync]+1 >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, sync);
+}
+
+static void iog_nr_requests_updated(struct io_group *iog)
+{
+	if (iog->rl.count[BLK_RW_SYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_SYNC);
+	else if (iog->rl.count[BLK_RW_SYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_SYNC);
+
+	if (iog->rl.count[BLK_RW_ASYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_ASYNC);
+	else if (iog->rl.count[BLK_RW_ASYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_ASYNC);
+}
+
+void elv_updated_nr_group_requests(struct request_queue *q)
+{
+	struct elv_fq_data *efqd;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	efqd = q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+		iog_nr_requests_updated(iog);
+	}
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1315,6 +1448,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		io_group_path(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1538,6 +1672,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 989102e..26c4857 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -95,6 +95,13 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+
+enum io_group_state {
+	IOG_async_congested,    /* The async queue of group is getting full */
+	IOG_sync_congested,     /* The sync queue of group is getting full */
+	IOG_unused,             /* Available bits start here */
+};
+
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
@@ -129,6 +136,11 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
+	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -453,6 +465,11 @@ elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
 
 struct request_list *
 elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
+extern int elv_page_io_group_congested(struct request_queue *q,
+					struct page *page, int sync);
+extern void elv_freed_request(struct request_list *rl, int sync);
+extern void elv_get_request(struct request_list *rl, int sync);
+extern void elv_updated_nr_group_requests(struct request_queue *q);
 
 #else /* !GROUP_IOSCHED */
 
@@ -491,9 +508,11 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
-
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* GROUP_IOSCHED */
 
@@ -606,6 +625,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d952b34..224d5a8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1170,7 +1170,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1180,9 +1181,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8a311ea..00a7d94 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a7663eb..bf533a9 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5fe39c2..10765da 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 7140909..52a54c7 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 898e2bd..915a95f 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8726fd7..0f0c6ac 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3d9020c..d85351f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..b6cc455 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e83be2e..35cd95a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5dbefd1..ed2d100 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c34b7f8..33d0339 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1470,6 +1470,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 7ebae9a..f5fba6c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index aecf251..5835a2e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..473223a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..1b13539 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 enum {
 	BLK_RW_ASYNC	= 0,
 	BLK_RW_SYNC	= 1,
@@ -237,7 +294,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+extern void congestion_wake_up(int sync);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 74deb17..247e237 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -846,6 +846,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int sync)
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..60c91e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -283,16 +284,22 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
 
+void congestion_wake_up(int sync)
+{
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
 	enum bdi_state bit;
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
 	clear_bit(bit, &bdi->state);
 	smp_mb__after_clear_bit();
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
+	congestion_wake_up(sync);
 }
 EXPORT_SYMBOL(clear_bdi_congested);
 
@@ -327,3 +334,64 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1df421b..f924e05 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -985,6 +985,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 21/23] io-controller: Per io group bdi congestion interface
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c            |   26 ++++++++
 block/blk-sysfs.c           |    6 +-
 block/elevator-fq.c         |  135 +++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |   24 +++++++-
 drivers/md/dm-table.c       |   11 ++-
 drivers/md/dm.c             |    7 +-
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 ++-
 drivers/md/multipath.c      |    7 ++-
 drivers/md/raid0.c          |    6 +-
 drivers/md/raid1.c          |    9 ++-
 drivers/md/raid10.c         |    6 +-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 ++-
 fs/btrfs/disk-io.c          |    6 +-
 fs/btrfs/extent_io.c        |   12 ++++
 fs/btrfs/volumes.c          |    8 ++-
 fs/cifs/file.c              |   11 ++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   63 +++++++++++++++++++-
 include/linux/blkdev.h      |    5 ++
 mm/backing-dev.c            |   74 ++++++++++++++++++++++-
 mm/page-writeback.c         |   11 ++++
 mm/readahead.c              |    2 +-
 28 files changed, 430 insertions(+), 40 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b400b..112a629 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q || !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_page_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_page_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
@@ -721,6 +742,8 @@ static void __freed_request(struct request_queue *q, int sync,
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
+	elv_freed_request(rl, sync);
+
 	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
@@ -830,6 +853,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, is_sync);
 
+	/* check if io group will get congested after this allocation*/
+	elv_get_request(rl, is_sync);
+
 	/* queue full seems redundant now */
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3db7f0..e0af5d6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,9 +83,8 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
+static ssize_t queue_group_requests_store(struct request_queue *q,
+					const char *page, size_t count)
 {
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
@@ -95,6 +94,7 @@ queue_group_requests_store(struct request_queue *q, const char *page,
 
 	spin_lock_irq(q->queue_lock);
 	q->nr_group_requests = nr;
+	elv_updated_nr_group_requests(q);
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39896c2..b43ac2f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -958,6 +958,139 @@ elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+void elv_clear_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	clear_bit(bit, &iog->state);
+	smp_mb__after_clear_bit();
+	congestion_wake_up(sync);
+}
+
+void elv_set_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	set_bit(bit, &iog->state);
+}
+
+static inline int elv_iog_congested(struct io_group *iog, int iog_bits)
+{
+	return iog->state & iog_bits;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_page_io_group_congested(struct request_queue *q, struct page *page,
+								int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = elv_io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd->root_group;
+	}
+
+	if (sync)
+		ret = elv_iog_congested(iog, 1 << IOG_sync_congested);
+	else
+		ret = elv_iog_congested(iog, 1 << IOG_async_congested);
+
+	if (ret)
+		elv_log_iog(q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline int
+elv_iog_congestion_on_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_on;
+}
+
+static inline int
+elv_iog_congestion_off_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_off;
+}
+
+void elv_freed_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync] < elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, sync);
+}
+
+void elv_get_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync]+1 >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, sync);
+}
+
+static void iog_nr_requests_updated(struct io_group *iog)
+{
+	if (iog->rl.count[BLK_RW_SYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_SYNC);
+	else if (iog->rl.count[BLK_RW_SYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_SYNC);
+
+	if (iog->rl.count[BLK_RW_ASYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_ASYNC);
+	else if (iog->rl.count[BLK_RW_ASYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_ASYNC);
+}
+
+void elv_updated_nr_group_requests(struct request_queue *q)
+{
+	struct elv_fq_data *efqd;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	efqd = q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+		iog_nr_requests_updated(iog);
+	}
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1315,6 +1448,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		io_group_path(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1538,6 +1672,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 989102e..26c4857 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -95,6 +95,13 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+
+enum io_group_state {
+	IOG_async_congested,    /* The async queue of group is getting full */
+	IOG_sync_congested,     /* The sync queue of group is getting full */
+	IOG_unused,             /* Available bits start here */
+};
+
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
@@ -129,6 +136,11 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
+	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -453,6 +465,11 @@ elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
 
 struct request_list *
 elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
+extern int elv_page_io_group_congested(struct request_queue *q,
+					struct page *page, int sync);
+extern void elv_freed_request(struct request_list *rl, int sync);
+extern void elv_get_request(struct request_list *rl, int sync);
+extern void elv_updated_nr_group_requests(struct request_queue *q);
 
 #else /* !GROUP_IOSCHED */
 
@@ -491,9 +508,11 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
-
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* GROUP_IOSCHED */
 
@@ -606,6 +625,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d952b34..224d5a8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1170,7 +1170,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1180,9 +1181,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8a311ea..00a7d94 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a7663eb..bf533a9 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5fe39c2..10765da 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 7140909..52a54c7 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 898e2bd..915a95f 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8726fd7..0f0c6ac 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3d9020c..d85351f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..b6cc455 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e83be2e..35cd95a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5dbefd1..ed2d100 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c34b7f8..33d0339 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1470,6 +1470,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 7ebae9a..f5fba6c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index aecf251..5835a2e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..473223a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..1b13539 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 enum {
 	BLK_RW_ASYNC	= 0,
 	BLK_RW_SYNC	= 1,
@@ -237,7 +294,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+extern void congestion_wake_up(int sync);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 74deb17..247e237 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -846,6 +846,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int sync)
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..60c91e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -283,16 +284,22 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
 
+void congestion_wake_up(int sync)
+{
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
 	enum bdi_state bit;
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
 	clear_bit(bit, &bdi->state);
 	smp_mb__after_clear_bit();
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
+	congestion_wake_up(sync);
 }
 EXPORT_SYMBOL(clear_bdi_congested);
 
@@ -327,3 +334,64 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1df421b..f924e05 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -985,6 +985,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 21/23] io-controller: Per io group bdi congestion interface
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c            |   26 ++++++++
 block/blk-sysfs.c           |    6 +-
 block/elevator-fq.c         |  135 +++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |   24 +++++++-
 drivers/md/dm-table.c       |   11 ++-
 drivers/md/dm.c             |    7 +-
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 ++-
 drivers/md/multipath.c      |    7 ++-
 drivers/md/raid0.c          |    6 +-
 drivers/md/raid1.c          |    9 ++-
 drivers/md/raid10.c         |    6 +-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 ++-
 fs/btrfs/disk-io.c          |    6 +-
 fs/btrfs/extent_io.c        |   12 ++++
 fs/btrfs/volumes.c          |    8 ++-
 fs/cifs/file.c              |   11 ++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   63 +++++++++++++++++++-
 include/linux/blkdev.h      |    5 ++
 mm/backing-dev.c            |   74 ++++++++++++++++++++++-
 mm/page-writeback.c         |   11 ++++
 mm/readahead.c              |    2 +-
 28 files changed, 430 insertions(+), 40 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b400b..112a629 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q || !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_page_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_page_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
@@ -721,6 +742,8 @@ static void __freed_request(struct request_queue *q, int sync,
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
+	elv_freed_request(rl, sync);
+
 	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
@@ -830,6 +853,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, is_sync);
 
+	/* check if io group will get congested after this allocation*/
+	elv_get_request(rl, is_sync);
+
 	/* queue full seems redundant now */
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3db7f0..e0af5d6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,9 +83,8 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
+static ssize_t queue_group_requests_store(struct request_queue *q,
+					const char *page, size_t count)
 {
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
@@ -95,6 +94,7 @@ queue_group_requests_store(struct request_queue *q, const char *page,
 
 	spin_lock_irq(q->queue_lock);
 	q->nr_group_requests = nr;
+	elv_updated_nr_group_requests(q);
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39896c2..b43ac2f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -958,6 +958,139 @@ elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+void elv_clear_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	clear_bit(bit, &iog->state);
+	smp_mb__after_clear_bit();
+	congestion_wake_up(sync);
+}
+
+void elv_set_iog_congested(struct io_group *iog, int sync)
+{
+	enum io_group_state bit;
+
+	bit = sync ? IOG_sync_congested : IOG_async_congested;
+	set_bit(bit, &iog->state);
+}
+
+static inline int elv_iog_congested(struct io_group *iog, int iog_bits)
+{
+	return iog->state & iog_bits;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_page_io_group_congested(struct request_queue *q, struct page *page,
+								int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = elv_io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd->root_group;
+	}
+
+	if (sync)
+		ret = elv_iog_congested(iog, 1 << IOG_sync_congested);
+	else
+		ret = elv_iog_congested(iog, 1 << IOG_async_congested);
+
+	if (ret)
+		elv_log_iog(q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline int
+elv_iog_congestion_on_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_on;
+}
+
+static inline int
+elv_iog_congestion_off_threshold(struct io_group *iog)
+{
+	return iog->nr_congestion_off;
+}
+
+void elv_freed_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync] < elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, sync);
+}
+
+void elv_get_request(struct request_list *rl, int sync)
+{
+	struct io_group *iog = rl_iog(rl);
+
+	if (iog->rl.count[sync]+1 >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, sync);
+}
+
+static void iog_nr_requests_updated(struct io_group *iog)
+{
+	if (iog->rl.count[BLK_RW_SYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_SYNC);
+	else if (iog->rl.count[BLK_RW_SYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_SYNC);
+
+	if (iog->rl.count[BLK_RW_ASYNC] >= elv_iog_congestion_on_threshold(iog))
+		elv_set_iog_congested(iog, BLK_RW_ASYNC);
+	else if (iog->rl.count[BLK_RW_ASYNC] <
+				elv_iog_congestion_off_threshold(iog))
+		elv_clear_iog_congested(iog, BLK_RW_ASYNC);
+}
+
+void elv_updated_nr_group_requests(struct request_queue *q)
+{
+	struct elv_fq_data *efqd;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	efqd = q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+		iog_nr_requests_updated(iog);
+	}
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1315,6 +1448,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		io_group_path(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1538,6 +1672,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 989102e..26c4857 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -95,6 +95,13 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+
+enum io_group_state {
+	IOG_async_congested,    /* The async queue of group is getting full */
+	IOG_sync_congested,     /* The sync queue of group is getting full */
+	IOG_unused,             /* Available bits start here */
+};
+
 struct io_group {
 	struct io_entity entity;
 	atomic_t ref;
@@ -129,6 +136,11 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
+	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -453,6 +465,11 @@ elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
 
 struct request_list *
 elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
+extern int elv_page_io_group_congested(struct request_queue *q,
+					struct page *page, int sync);
+extern void elv_freed_request(struct request_list *rl, int sync);
+extern void elv_get_request(struct request_list *rl, int sync);
+extern void elv_updated_nr_group_requests(struct request_queue *q);
 
 #else /* !GROUP_IOSCHED */
 
@@ -491,9 +508,11 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	return NULL;
 }
-
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* GROUP_IOSCHED */
 
@@ -606,6 +625,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 
 static inline void elv_get_rl_iog(struct request_list *rl) { }
 static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }
 
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d952b34..224d5a8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1170,7 +1170,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1180,9 +1181,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8a311ea..00a7d94 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a7663eb..bf533a9 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5fe39c2..10765da 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 7140909..52a54c7 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 898e2bd..915a95f 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8726fd7..0f0c6ac 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3d9020c..d85351f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..b6cc455 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e83be2e..35cd95a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5dbefd1..ed2d100 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c34b7f8..33d0339 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1470,6 +1470,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 7ebae9a..f5fba6c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index aecf251..5835a2e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..473223a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..1b13539 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 enum {
 	BLK_RW_ASYNC	= 0,
 	BLK_RW_SYNC	= 1,
@@ -237,7 +294,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+extern void congestion_wake_up(int sync);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 74deb17..247e237 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -846,6 +846,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int sync)
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..60c91e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -283,16 +284,22 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
 
+void congestion_wake_up(int sync)
+{
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
 	enum bdi_state bit;
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
 	clear_bit(bit, &bdi->state);
 	smp_mb__after_clear_bit();
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
+	congestion_wake_up(sync);
 }
 EXPORT_SYMBOL(clear_bdi_congested);
 
@@ -327,3 +334,64 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1df421b..f924e05 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -985,6 +985,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 21/23] io-controller: Per io group bdi congestion interface Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-28 21:31   ` [PATCH 23/23] io-controller: debug elevator fair queuing support Vivek Goyal
                     ` (8 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  263 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 269 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b43ac2f..9e714d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -866,12 +867,26 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 #ifdef CONFIG_GROUP_IOSCHED
 static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+							dev_t dev);
+static void
+io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog, dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = iocg->weight;
-	entity->ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sd = &iog->sched_data;
 }
@@ -1111,6 +1126,229 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > IO_WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static void update_iog_weight_prio(struct io_group *iog, struct io_cgroup *iocg,
+					struct io_policy_node *pn)
+{
+	if (pn->weight) {
+		iog->entity.weight = pn->weight;
+		iog->entity.ioprio_class = pn->ioprio_class;
+		/*
+		 * iog weight and ioprio_class updating actually happens if
+		 * ioprio_changed is set. So ensure ioprio_changed is not set
+		 * until new weight and new ioprio_class are updated.
+		 */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	} else {
+		iog->entity.weight = iocg->weight;
+		iog->entity.ioprio_class = iocg->ioprio_class;
+
+		/* The same as above */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	}
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev)
+			update_iog_weight_prio(iog, iocg, newpn);
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1143,6 +1381,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1155,6 +1394,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1290,6 +1532,12 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype io_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1340,6 +1588,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_WEIGHT_DEFAULT;
 	iocg->ioprio_class = IOPRIO_CLASS_BE;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1433,7 +1682,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 
 		atomic_set(&iog->ref, 0);
 
@@ -1761,6 +2010,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1800,6 +2050,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 26c4857..d462269 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,12 +145,22 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 struct io_cgroup {
 	struct cgroup_subsys_state css;
 
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  263 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 269 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b43ac2f..9e714d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -866,12 +867,26 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 #ifdef CONFIG_GROUP_IOSCHED
 static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+							dev_t dev);
+static void
+io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog, dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = iocg->weight;
-	entity->ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sd = &iog->sched_data;
 }
@@ -1111,6 +1126,229 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > IO_WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static void update_iog_weight_prio(struct io_group *iog, struct io_cgroup *iocg,
+					struct io_policy_node *pn)
+{
+	if (pn->weight) {
+		iog->entity.weight = pn->weight;
+		iog->entity.ioprio_class = pn->ioprio_class;
+		/*
+		 * iog weight and ioprio_class updating actually happens if
+		 * ioprio_changed is set. So ensure ioprio_changed is not set
+		 * until new weight and new ioprio_class are updated.
+		 */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	} else {
+		iog->entity.weight = iocg->weight;
+		iog->entity.ioprio_class = iocg->ioprio_class;
+
+		/* The same as above */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	}
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev)
+			update_iog_weight_prio(iog, iocg, newpn);
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1143,6 +1381,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1155,6 +1394,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1290,6 +1532,12 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype io_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1340,6 +1588,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_WEIGHT_DEFAULT;
 	iocg->ioprio_class = IOPRIO_CLASS_BE;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1433,7 +1682,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 
 		atomic_set(&iog->ref, 0);
 
@@ -1761,6 +2010,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1800,6 +2050,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 26c4857..d462269 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,12 +145,22 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 struct io_cgroup {
 	struct cgroup_subsys_state css;
 
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  263 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 269 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b43ac2f..9e714d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 #include "elevator-fq.h"
 
 const int elv_slice_sync = HZ / 10;
@@ -866,12 +867,26 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
 #ifdef CONFIG_GROUP_IOSCHED
 static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+							dev_t dev);
+static void
+io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog, dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = iocg->weight;
-	entity->ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sd = &iog->sched_data;
 }
@@ -1111,6 +1126,229 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > IO_WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static void update_iog_weight_prio(struct io_group *iog, struct io_cgroup *iocg,
+					struct io_policy_node *pn)
+{
+	if (pn->weight) {
+		iog->entity.weight = pn->weight;
+		iog->entity.ioprio_class = pn->ioprio_class;
+		/*
+		 * iog weight and ioprio_class updating actually happens if
+		 * ioprio_changed is set. So ensure ioprio_changed is not set
+		 * until new weight and new ioprio_class are updated.
+		 */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	} else {
+		iog->entity.weight = iocg->weight;
+		iog->entity.ioprio_class = iocg->ioprio_class;
+
+		/* The same as above */
+		smp_wmb();
+		iog->entity.ioprio_changed = 1;
+	}
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev)
+			update_iog_weight_prio(iog, iocg, newpn);
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1143,6 +1381,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1155,6 +1394,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1290,6 +1532,12 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype io_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1340,6 +1588,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_WEIGHT_DEFAULT;
 	iocg->ioprio_class = IOPRIO_CLASS_BE;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1433,7 +1682,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 
 		atomic_set(&iog->ref, 0);
 
@@ -1761,6 +2010,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1800,6 +2050,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 26c4857..d462269 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,12 +145,22 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 struct io_cgroup {
 	struct cgroup_subsys_state css;
 
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 23/23] io-controller: debug elevator fair queuing support
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (21 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 22/23] io-controller: Support per cgroup per device weights and io class Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  2009-08-31  1:09   ` [RFC] IO scheduler based IO controller V9 Gui Jianfeng
                     ` (7 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o More debugging help to debug elevator fair queuing support. Enabled under
  CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
  trace messages in blktrace.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    9 +++++++++
 block/elevator-fq.c   |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8b507c4..edcd317 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -15,6 +15,15 @@ config ELV_FAIR_QUEUING
 	  other ioschedulers can make use of it.
 	  If unsure, say N.
 
+config DEBUG_ELV_FAIR_QUEUING
+	bool "Debug elevator fair queuing"
+	depends on ELV_FAIR_QUEUING
+	default n
+	---help---
+	  Enable some debugging hooks for elevator fair queuing support.
+	  Currently it just outputs more information about vdisktime in
+	  blktrace output .
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9e714d5..b723c12 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -34,6 +34,24 @@ static struct kmem_cache *elv_ioq_pool;
 #define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
 				{ RB_ROOT, NULL, 0, NULL, 0})
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+#define elv_log_entity(entity, fmt, args...) 			\
+{                                                               \
+{								\
+	struct io_queue *ioq = ioq_of(entity);			\
+	struct io_group *iog = iog_of(entity);			\
+								\
+	if (ioq) {						\
+		elv_log_ioq(ioq->efqd, ioq, fmt, ##args);	\
+	} else	{						\
+		elv_log_iog((struct elv_fq_data *)iog->key, iog, fmt, ##args);\
+	}							\
+}								\
+}
+#else
+#define elv_log_entity(entity, fmt, args...)
+#endif
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -350,15 +368,39 @@ static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
 static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
 #endif /* DEBUG_GROUP_IOSCHED */
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta)
+{
+	struct elv_fq_data *efqd;
+	struct io_group *iog;
+
+	elv_log_entity(entity, "vdisktime=%llu service=%lu delta=%llu"
+				" entity->weight=%u", entity->vdisktime,
+				served, delta, entity->weight);
+
+	iog = iog_of(parent_entity(entity));
+	efqd = iog->key;
+	elv_log_iog(efqd, iog, "min_vdisktime=%llu", entity->st->min_vdisktime);
+}
+#else /* DEBUG_ELV_FAIR_QUEUING */
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta) {}
+#endif /* DEBUG_ELV_FAIR_QUEUING */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
-		entity->vdisktime += elv_delta_fair(served, entity);
+		u64 delta;
+
+		delta = elv_delta_fair(served, entity);
+		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
 		entity->total_time += served;
 		entity->total_sectors += nr_sectors;
+		debug_entity_vdisktime(entity, served, delta);
 	}
 }
 
@@ -391,6 +433,9 @@ static void place_entity(struct io_service_tree *st, struct io_entity *entity,
 		vdisktime = st->min_vdisktime;
 
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	elv_log_entity(entity, "place_entity: vdisktime=%llu"
+			" min_vdisktime=%llu", entity->vdisktime,
+			st->min_vdisktime);
 }
 
 static inline void io_entity_update_prio(struct io_entity *entity)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 23/23] io-controller: debug elevator fair queuing support
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-28 21:31   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel

o More debugging help to debug elevator fair queuing support. Enabled under
  CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
  trace messages in blktrace.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    9 +++++++++
 block/elevator-fq.c   |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8b507c4..edcd317 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -15,6 +15,15 @@ config ELV_FAIR_QUEUING
 	  other ioschedulers can make use of it.
 	  If unsure, say N.
 
+config DEBUG_ELV_FAIR_QUEUING
+	bool "Debug elevator fair queuing"
+	depends on ELV_FAIR_QUEUING
+	default n
+	---help---
+	  Enable some debugging hooks for elevator fair queuing support.
+	  Currently it just outputs more information about vdisktime in
+	  blktrace output .
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9e714d5..b723c12 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -34,6 +34,24 @@ static struct kmem_cache *elv_ioq_pool;
 #define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
 				{ RB_ROOT, NULL, 0, NULL, 0})
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+#define elv_log_entity(entity, fmt, args...) 			\
+{                                                               \
+{								\
+	struct io_queue *ioq = ioq_of(entity);			\
+	struct io_group *iog = iog_of(entity);			\
+								\
+	if (ioq) {						\
+		elv_log_ioq(ioq->efqd, ioq, fmt, ##args);	\
+	} else	{						\
+		elv_log_iog((struct elv_fq_data *)iog->key, iog, fmt, ##args);\
+	}							\
+}								\
+}
+#else
+#define elv_log_entity(entity, fmt, args...)
+#endif
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -350,15 +368,39 @@ static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
 static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
 #endif /* DEBUG_GROUP_IOSCHED */
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta)
+{
+	struct elv_fq_data *efqd;
+	struct io_group *iog;
+
+	elv_log_entity(entity, "vdisktime=%llu service=%lu delta=%llu"
+				" entity->weight=%u", entity->vdisktime,
+				served, delta, entity->weight);
+
+	iog = iog_of(parent_entity(entity));
+	efqd = iog->key;
+	elv_log_iog(efqd, iog, "min_vdisktime=%llu", entity->st->min_vdisktime);
+}
+#else /* DEBUG_ELV_FAIR_QUEUING */
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta) {}
+#endif /* DEBUG_ELV_FAIR_QUEUING */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
-		entity->vdisktime += elv_delta_fair(served, entity);
+		u64 delta;
+
+		delta = elv_delta_fair(served, entity);
+		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
 		entity->total_time += served;
 		entity->total_sectors += nr_sectors;
+		debug_entity_vdisktime(entity, served, delta);
 	}
 }
 
@@ -391,6 +433,9 @@ static void place_entity(struct io_service_tree *st, struct io_entity *entity,
 		vdisktime = st->min_vdisktime;
 
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	elv_log_entity(entity, "place_entity: vdisktime=%llu"
+			" min_vdisktime=%llu", entity->vdisktime,
+			st->min_vdisktime);
 }
 
 static inline void io_entity_update_prio(struct io_entity *entity)
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH 23/23] io-controller: debug elevator fair queuing support
@ 2009-08-28 21:31   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:31 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	akpm, righi.andrea, torvalds

o More debugging help to debug elevator fair queuing support. Enabled under
  CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
  trace messages in blktrace.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    9 +++++++++
 block/elevator-fq.c   |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8b507c4..edcd317 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -15,6 +15,15 @@ config ELV_FAIR_QUEUING
 	  other ioschedulers can make use of it.
 	  If unsure, say N.
 
+config DEBUG_ELV_FAIR_QUEUING
+	bool "Debug elevator fair queuing"
+	depends on ELV_FAIR_QUEUING
+	default n
+	---help---
+	  Enable some debugging hooks for elevator fair queuing support.
+	  Currently it just outputs more information about vdisktime in
+	  blktrace output .
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9e714d5..b723c12 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -34,6 +34,24 @@ static struct kmem_cache *elv_ioq_pool;
 #define ELV_SERVICE_TREE_INIT   ((struct io_service_tree)	\
 				{ RB_ROOT, NULL, 0, NULL, 0})
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+#define elv_log_entity(entity, fmt, args...) 			\
+{                                                               \
+{								\
+	struct io_queue *ioq = ioq_of(entity);			\
+	struct io_group *iog = iog_of(entity);			\
+								\
+	if (ioq) {						\
+		elv_log_ioq(ioq->efqd, ioq, fmt, ##args);	\
+	} else	{						\
+		elv_log_iog((struct elv_fq_data *)iog->key, iog, fmt, ##args);\
+	}							\
+}								\
+}
+#else
+#define elv_log_entity(entity, fmt, args...)
+#endif
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -350,15 +368,39 @@ static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
 static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
 #endif /* DEBUG_GROUP_IOSCHED */
 
+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta)
+{
+	struct elv_fq_data *efqd;
+	struct io_group *iog;
+
+	elv_log_entity(entity, "vdisktime=%llu service=%lu delta=%llu"
+				" entity->weight=%u", entity->vdisktime,
+				served, delta, entity->weight);
+
+	iog = iog_of(parent_entity(entity));
+	efqd = iog->key;
+	elv_log_iog(efqd, iog, "min_vdisktime=%llu", entity->st->min_vdisktime);
+}
+#else /* DEBUG_ELV_FAIR_QUEUING */
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+					unsigned long served, u64 delta) {}
+#endif /* DEBUG_ELV_FAIR_QUEUING */
+
 static void
 entity_served(struct io_entity *entity, unsigned long served,
 				unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
-		entity->vdisktime += elv_delta_fair(served, entity);
+		u64 delta;
+
+		delta = elv_delta_fair(served, entity);
+		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
 		entity->total_time += served;
 		entity->total_sectors += nr_sectors;
+		debug_entity_vdisktime(entity, served, delta);
 	}
 }
 
@@ -391,6 +433,9 @@ static void place_entity(struct io_service_tree *st, struct io_entity *entity,
 		vdisktime = st->min_vdisktime;
 
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	elv_log_entity(entity, "place_entity: vdisktime=%llu"
+			" min_vdisktime=%llu", entity->vdisktime,
+			st->min_vdisktime);
 }
 
 static inline void io_entity_update_prio(struct io_entity *entity)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH 02/23] io-controller: Core of the elevator fair queuing
       [not found]   ` <1251495072-7780-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-28 22:26     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-28 22:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o This is core of the io scheduler implemented at elevator layer. This is a mix
>   of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
>   to be derived so that we can support hierarchical scheduling. Without
>   cgroups or with-in group, we should essentially get same behavior as CFQ.
> 
> o This patch only shows non-hierarchical bits. Hierarhical code comes in later
>   patches.
> 
> o This code is the building base of introducing fair queuing logic in common
>   elevator layer so that it can be used by all the four IO schedulers.
> 
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 02/23] io-controller: Core of the elevator fair queuing
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-28 22:26     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-28 22:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o This is core of the io scheduler implemented at elevator layer. This is a mix
>   of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
>   to be derived so that we can support hierarchical scheduling. Without
>   cgroups or with-in group, we should essentially get same behavior as CFQ.
> 
> o This patch only shows non-hierarchical bits. Hierarhical code comes in later
>   patches.
> 
> o This code is the building base of introducing fair queuing logic in common
>   elevator layer so that it can be used by all the four IO schedulers.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 02/23] io-controller: Core of the elevator fair queuing
@ 2009-08-28 22:26     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-28 22:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o This is core of the io scheduler implemented at elevator layer. This is a mix
>   of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
>   to be derived so that we can support hierarchical scheduling. Without
>   cgroups or with-in group, we should essentially get same behavior as CFQ.
> 
> o This patch only shows non-hierarchical bits. Hierarhical code comes in later
>   patches.
> 
> o This code is the building base of introducing fair queuing logic in common
>   elevator layer so that it can be used by all the four IO schedulers.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
       [not found]   ` <1251495072-7780-4-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29  1:29     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This is common fair queuing code in elevator layer. This is controlled by
> config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
> flat fair queuing support where there is only one group, "root group" and all
> the tasks belong to root group.
> 
> This elevator layer changes are backward compatible. That means any ioscheduler
> using old interfaces will continue to work.
> 
> This is essentially a lot of CFQ logic moved into common layer so that other
> IO schedulers can make use of that in hierarhical scheduling setup.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29  1:29     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This is common fair queuing code in elevator layer. This is controlled by
> config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
> flat fair queuing support where there is only one group, "root group" and all
> the tasks belong to root group.
> 
> This elevator layer changes are backward compatible. That means any ioscheduler
> using old interfaces will continue to work.
> 
> This is essentially a lot of CFQ logic moved into common layer so that other
> IO schedulers can make use of that in hierarhical scheduling setup.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Aristeu Rozanski <aris@redhat.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer
@ 2009-08-29  1:29     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This is common fair queuing code in elevator layer. This is controlled by
> config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
> flat fair queuing support where there is only one group, "root group" and all
> the tasks belong to root group.
> 
> This elevator layer changes are backward compatible. That means any ioscheduler
> using old interfaces will continue to work.
> 
> This is essentially a lot of CFQ logic moved into common layer so that other
> IO schedulers can make use of that in hierarhical scheduling setup.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Aristeu Rozanski <aris@redhat.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found]   ` <1251495072-7780-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29  1:44     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This patch changes cfq to use fair queuing code from elevator layer.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29  1:44     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This patch changes cfq to use fair queuing code from elevator layer.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing
@ 2009-08-29  1:44     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  1:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This patch changes cfq to use fair queuing code from elevator layer.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
       [not found]   ` <1251495072-7780-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29  3:31     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o This patch introduces core changes in fair queuing scheduler to support
>   hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.
> 
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29  3:31     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o This patch introduces core changes in fair queuing scheduler to support
>   hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling
@ 2009-08-29  3:31     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o This patch introduces core changes in fair queuing scheduler to support
>   hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
       [not found]   ` <1251495072-7780-7-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29  3:37     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o This patch introduces some of the cgroup related code for io controller.
> 
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29  3:37     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o This patch introduces some of the cgroup related code for io controller.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support
@ 2009-08-29  3:37     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29  3:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o This patch introduces some of the cgroup related code for io controller.
> 
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1251495072-7780-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29 23:04     ` Rik van Riel
  2009-09-03  3:08     ` Munehiro Ikeda
  1 sibling, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o This patch enables hierarchical fair queuing in common layer. It is
>   controlled by config option CONFIG_GROUP_IOSCHED.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29 23:04     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o This patch enables hierarchical fair queuing in common layer. It is
>   controlled by config option CONFIG_GROUP_IOSCHED.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-08-29 23:04     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o This patch enables hierarchical fair queuing in common layer. It is
>   controlled by config option CONFIG_GROUP_IOSCHED.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1251495072-7780-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29 23:11     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> Make cfq hierarhical.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29 23:11     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> Make cfq hierarhical.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Aristeu Rozanski <aris@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 08/23] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
@ 2009-08-29 23:11     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> Make cfq hierarhical.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
> Signed-off-by: Aristeu Rozanski <aris@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]   ` <1251495072-7780-10-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-29 23:12     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o This patch exports some statistics through cgroup interface. Two of the
>   statistics currently exported are actual disk time assigned to the cgroup
>   and actual number of sectors dispatched to disk on behalf of this cgroup.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-29 23:12     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o This patch exports some statistics through cgroup interface. Two of the
>   statistics currently exported are actual disk time assigned to the cgroup
>   and actual number of sectors dispatched to disk on behalf of this cgroup.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups
@ 2009-08-29 23:12     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-29 23:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o This patch exports some statistics through cgroup interface. Two of the
>   statistics currently exported are actual disk time assigned to the cgroup
>   and actual number of sectors dispatched to disk on behalf of this cgroup.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
       [not found]   ` <1251495072-7780-11-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-30  0:10     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o Littile debugging aid for hierarchical IO scheduling.
> 
> o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
> 
> o Currently it outputs more debug messages in blktrace output which helps
>   a great deal in debugging in hierarchical setup. It also creates additional
>   cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
>   debugging data.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-08-30  0:10     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o Littile debugging aid for hierarchical IO scheduling.
> 
> o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
> 
> o Currently it outputs more debug messages in blktrace output which helps
>   a great deal in debugging in hierarchical setup. It also creates additional
>   cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
>   debugging data.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 10/23] io-controller: Debug hierarchical IO scheduling
@ 2009-08-30  0:10     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o Littile debugging aid for hierarchical IO scheduling.
> 
> o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
> 
> o Currently it outputs more debug messages in blktrace output which helps
>   a great deal in debugging in hierarchical setup. It also creates additional
>   cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
>   debugging data.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 11/23] io-controller: Introduce group idling
       [not found]   ` <1251495072-7780-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-30  0:38     ` Rik van Riel
  2009-09-18  3:56     ` [PATCH] io-controller: Fix another bug that causing system hanging Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o It is not always that IO from a process or group is continuous. There are
>   cases of dependent reads where next read is not issued till previous read
>   has finished. For such cases, CFQ introduced the notion of slice_idle,
>   where we idle on the queue for sometime hoping next request will come
>   and that's how fairness is provided otherwise queue will be deleted
>   immediately from the service tree and this process will not get the
>   fair share.
> 
> o This patch introduces the similar concept at group level. Idle on the group
>   for a period of "group_idle" which is tunable through sysfs interface. So
>   if a group is empty and about to be deleted, we idle for the next request.
> 
> o This patch also introduces the notion of wait busy where we wait for one
>   extra group_idle period even if queue has consumed its time slice. The
>   reason being that group will loose its share upon removal from service
>   tree as some other entity will be picked for dispatch and vtime jump will
>   take place.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 11/23] io-controller: Introduce group idling
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-30  0:38     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o It is not always that IO from a process or group is continuous. There are
>   cases of dependent reads where next read is not issued till previous read
>   has finished. For such cases, CFQ introduced the notion of slice_idle,
>   where we idle on the queue for sometime hoping next request will come
>   and that's how fairness is provided otherwise queue will be deleted
>   immediately from the service tree and this process will not get the
>   fair share.
> 
> o This patch introduces the similar concept at group level. Idle on the group
>   for a period of "group_idle" which is tunable through sysfs interface. So
>   if a group is empty and about to be deleted, we idle for the next request.
> 
> o This patch also introduces the notion of wait busy where we wait for one
>   extra group_idle period even if queue has consumed its time slice. The
>   reason being that group will loose its share upon removal from service
>   tree as some other entity will be picked for dispatch and vtime jump will
>   take place.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 11/23] io-controller: Introduce group idling
@ 2009-08-30  0:38     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o It is not always that IO from a process or group is continuous. There are
>   cases of dependent reads where next read is not issued till previous read
>   has finished. For such cases, CFQ introduced the notion of slice_idle,
>   where we idle on the queue for sometime hoping next request will come
>   and that's how fairness is provided otherwise queue will be deleted
>   immediately from the service tree and this process will not get the
>   fair share.
> 
> o This patch introduces the similar concept at group level. Idle on the group
>   for a period of "group_idle" which is tunable through sysfs interface. So
>   if a group is empty and about to be deleted, we idle for the next request.
> 
> o This patch also introduces the notion of wait busy where we wait for one
>   extra group_idle period even if queue has consumed its time slice. The
>   reason being that group will loose its share upon removal from service
>   tree as some other entity will be picked for dispatch and vtime jump will
>   take place.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
       [not found]   ` <1251495072-7780-13-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-30  0:40     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:

> o Because above behavior can result in reduced throughput, this behavior will
>   be enabled only if user sets "fairness" tunable to 1.
> 
> o This patch helps in achieving more isolation between reads and buffered
>   writes in different cgroups. 

Could be a nice help against SATA NCQ request starvation.

> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-30  0:40     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:

> o Because above behavior can result in reduced throughput, this behavior will
>   be enabled only if user sets "fairness" tunable to 1.
> 
> o This patch helps in achieving more isolation between reads and buffered
>   writes in different cgroups. 

Could be a nice help against SATA NCQ request starvation.

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled
@ 2009-08-30  0:40     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-30  0:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:

> o Because above behavior can result in reduced throughput, this behavior will
>   be enabled only if user sets "fairness" tunable to 1.
> 
> o This patch helps in achieving more isolation between reads and buffered
>   writes in different cgroups. 

Could be a nice help against SATA NCQ request starvation.

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (22 preceding siblings ...)
  2009-08-28 21:31   ` [PATCH 23/23] io-controller: debug elevator fair queuing support Vivek Goyal
@ 2009-08-31  1:09   ` Gui Jianfeng
  2009-09-02  0:58   ` Gui Jianfeng
                     ` (6 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-08-31  1:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Ok, will do.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-08-31  1:09   ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-08-31  1:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Ok, will do.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-08-31  1:09   ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-08-31  1:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Ok, will do.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
       [not found]   ` <1251495072-7780-15-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31  2:49     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> Elevator layer now has support for hierarchical fair queuing. cfq has
> been migrated to make use of it and now it is time to do groundwork for
> noop, deadline and AS.
> 
> noop deadline and AS don't maintain separate queues for different processes.
> There is only one single queue. Effectively one can think that in hierarchical
> setup, there will be one queue per cgroup where requests from all the
> processes in the cgroup will be queued.
> 
> Generally io scheduler takes care of creating queues. Because there is
> only one queue here, we have modified common layer to take care of queue
> creation and some other functionality. This special casing helps in keeping
> the changes to noop, deadline and AS to the minimum.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31  2:49     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> Elevator layer now has support for hierarchical fair queuing. cfq has
> been migrated to make use of it and now it is time to do groundwork for
> noop, deadline and AS.
> 
> noop deadline and AS don't maintain separate queues for different processes.
> There is only one single queue. Effectively one can think that in hierarchical
> setup, there will be one queue per cgroup where requests from all the
> processes in the cgroup will be queued.
> 
> Generally io scheduler takes care of creating queues. Because there is
> only one queue here, we have modified common layer to take care of queue
> creation and some other functionality. This special casing helps in keeping
> the changes to noop, deadline and AS to the minimum.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers
@ 2009-08-31  2:49     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> Elevator layer now has support for hierarchical fair queuing. cfq has
> been migrated to make use of it and now it is time to do groundwork for
> noop, deadline and AS.
> 
> noop deadline and AS don't maintain separate queues for different processes.
> There is only one single queue. Effectively one can think that in hierarchical
> setup, there will be one queue per cgroup where requests from all the
> processes in the cgroup will be queued.
> 
> Generally io scheduler takes care of creating queues. Because there is
> only one queue here, we have modified common layer to take care of queue
> creation and some other functionality. This special casing helps in keeping
> the changes to noop, deadline and AS to the minimum.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
       [not found]   ` <1251495072-7780-16-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31  2:52     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This patch changes noop to use queue scheduling code from elevator layer.
> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Not sure why noop needs hierarchical fair queueing
support, but this patch is so small we might as well
take it to keep things consistent between schedulers.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31  2:52     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This patch changes noop to use queue scheduling code from elevator layer.
> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Not sure why noop needs hierarchical fair queueing
support, but this patch is so small we might as well
take it to keep things consistent between schedulers.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
@ 2009-08-31  2:52     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  2:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This patch changes noop to use queue scheduling code from elevator layer.
> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Not sure why noop needs hierarchical fair queueing
support, but this patch is so small we might as well
take it to keep things consistent between schedulers.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
       [not found]   ` <1251495072-7780-17-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31  3:13     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  3:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This patch changes deadline to use queue scheduling code from elevator layer.
> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
                                     ^ deselecting ?

> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31  3:13     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  3:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This patch changes deadline to use queue scheduling code from elevator layer.
> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
                                     ^ deselecting ?

> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
@ 2009-08-31  3:13     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31  3:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This patch changes deadline to use queue scheduling code from elevator layer.
> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
                                     ^ deselecting ?

> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
       [not found]     ` <4A9B3FD3.6000407-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 13:46       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 13:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Aug 30, 2009 at 11:13:23PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes deadline to use queue scheduling code from elevator layer.
>> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
>                                     ^ deselecting ?
>

Yes. Its a typo. Should be "deselecting". Will fix the comment.

Thanks
Vivek

>> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
  2009-08-31  3:13     ` Rik van Riel
@ 2009-08-31 13:46       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 13:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

On Sun, Aug 30, 2009 at 11:13:23PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes deadline to use queue scheduling code from elevator layer.
>> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
>                                     ^ deselecting ?
>

Yes. Its a typo. Should be "deselecting". Will fix the comment.

Thanks
Vivek

>> Signed-off-by: Nauman Rafique <nauman@google.com>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Acked-by: Rik van Riel <riel@redhat.com>
>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 16/23] io-controller: deadline changes for hierarchical fair queuing
@ 2009-08-31 13:46       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 13:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Sun, Aug 30, 2009 at 11:13:23PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes deadline to use queue scheduling code from elevator layer.
>> One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
>                                     ^ deselecting ?
>

Yes. Its a typo. Should be "deselecting". Will fix the comment.

Thanks
Vivek

>> Signed-off-by: Nauman Rafique <nauman@google.com>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Acked-by: Rik van Riel <riel@redhat.com>
>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 13/23] io-controller: Separate out queue and data
       [not found]   ` <1251495072-7780-14-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 15:27     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 15:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o So far noop, deadline and AS had one common structure called *_data which
>   contained both the queue information where requests are queued and also
>   common data used for scheduling. This patch breaks down this common
>   structure in two parts, *_queue and *_data. This is along the lines of
>   cfq where all the reuquests are queued in queue and common data and tunables
>   are part of data.
> 
> o It does not change the functionality but this re-organization helps once
>   noop, deadline and AS are changed to use hierarchical fair queuing.
> 
> o looks like queue_empty function is not required and we can check for
>   q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
>   not.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 13/23] io-controller: Separate out queue and data
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 15:27     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 15:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o So far noop, deadline and AS had one common structure called *_data which
>   contained both the queue information where requests are queued and also
>   common data used for scheduling. This patch breaks down this common
>   structure in two parts, *_queue and *_data. This is along the lines of
>   cfq where all the reuquests are queued in queue and common data and tunables
>   are part of data.
> 
> o It does not change the functionality but this re-organization helps once
>   noop, deadline and AS are changed to use hierarchical fair queuing.
> 
> o looks like queue_empty function is not required and we can check for
>   q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
>   not.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 13/23] io-controller: Separate out queue and data
@ 2009-08-31 15:27     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 15:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o So far noop, deadline and AS had one common structure called *_data which
>   contained both the queue information where requests are queued and also
>   common data used for scheduling. This patch breaks down this common
>   structure in two parts, *_queue and *_data. This is along the lines of
>   cfq where all the reuquests are queued in queue and common data and tunables
>   are part of data.
> 
> o It does not change the functionality but this re-organization helps once
>   noop, deadline and AS are changed to use hierarchical fair queuing.
> 
> o looks like queue_empty function is not required and we can check for
>   q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
>   not.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
       [not found]   ` <1251495072-7780-18-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 17:21     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This patch changes anticipatory scheduler to use queue scheduling code from
> elevator layer.  One can go back to old as by deselecting
> CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
> other cgroup created, AS behavior should remain the same as old.
> 
> o AS is a single queue ioschduler, that means there is one AS queue per group.
> 
> o common layer code select the queue to dispatch from based on fairness, and
>   then AS code selects the request with-in group.
> 
> o AS runs reads and writes batches with-in group. So common layer runs timed
>   group queues and with-in group time, AS runs timed batches of reads and
>   writes.
> 
> o Note: Previously AS write batch length was adjusted synamically whenever
>   a W->R batch data direction took place and when first request from the
>   read batch completed.
> 
>   Now write batch updation takes place when last request from the write
>   batch has finished during W->R transition.
> 
> o AS runs its own anticipation logic to anticipate on reads. common layer also
>   does the anticipation on the group if think time of the group is with-in
>   slice_idle.
> 
> o Introduced few debugging messages in AS.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 17:21     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This patch changes anticipatory scheduler to use queue scheduling code from
> elevator layer.  One can go back to old as by deselecting
> CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
> other cgroup created, AS behavior should remain the same as old.
> 
> o AS is a single queue ioschduler, that means there is one AS queue per group.
> 
> o common layer code select the queue to dispatch from based on fairness, and
>   then AS code selects the request with-in group.
> 
> o AS runs reads and writes batches with-in group. So common layer runs timed
>   group queues and with-in group time, AS runs timed batches of reads and
>   writes.
> 
> o Note: Previously AS write batch length was adjusted synamically whenever
>   a W->R batch data direction took place and when first request from the
>   read batch completed.
> 
>   Now write batch updation takes place when last request from the write
>   batch has finished during W->R transition.
> 
> o AS runs its own anticipation logic to anticipate on reads. common layer also
>   does the anticipation on the group if think time of the group is with-in
>   slice_idle.
> 
> o Introduced few debugging messages in AS.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 17/23] io-controller: anticipatory changes for hierarchical fair queuing
@ 2009-08-31 17:21     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This patch changes anticipatory scheduler to use queue scheduling code from
> elevator layer.  One can go back to old as by deselecting
> CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
> other cgroup created, AS behavior should remain the same as old.
> 
> o AS is a single queue ioschduler, that means there is one AS queue per group.
> 
> o common layer code select the queue to dispatch from based on fairness, and
>   then AS code selects the request with-in group.
> 
> o AS runs reads and writes batches with-in group. So common layer runs timed
>   group queues and with-in group time, AS runs timed batches of reads and
>   writes.
> 
> o Note: Previously AS write batch length was adjusted synamically whenever
>   a W->R batch data direction took place and when first request from the
>   read batch completed.
> 
>   Now write batch updation takes place when last request from the write
>   batch has finished during W->R transition.
> 
> o AS runs its own anticipation logic to anticipate on reads. common layer also
>   does the anticipation on the group if think time of the group is with-in
>   slice_idle.
> 
> o Introduced few debugging messages in AS.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]   ` <1251495072-7780-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 17:34     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o blkio_cgroup patches from Ryo to track async bios.
> 
> o This functionality is used to determine the group of async IO from page
>   instead of context of submitting task.
> 
> Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
> Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

This seems to be the most complex part of the code so far,
but I see why this code is necessary.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 17:34     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o blkio_cgroup patches from Ryo to track async bios.
> 
> o This functionality is used to determine the group of async IO from page
>   instead of context of submitting task.
> 
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

This seems to be the most complex part of the code so far,
but I see why this code is necessary.

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-08-31 17:34     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o blkio_cgroup patches from Ryo to track async bios.
> 
> o This functionality is used to determine the group of async IO from page
>   instead of context of submitting task.
> 
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

This seems to be the most complex part of the code so far,
but I see why this code is necessary.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 19/23] io-controller: map async requests to appropriate cgroup
       [not found]   ` <1251495072-7780-20-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 17:39     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o So far we were assuming that a bio/rq belongs to the task who is submitting
>   it. It did not hold good in case of async writes. This patch makes use of
>   blkio_cgroup pataches to attribute the aysnc writes to right group instead
>   of task submitting the bio.
> 
> o For sync requests, we continue to assume that io belongs to the task
>   submitting it. Only in case of async requests, we make use of io tracking
>   patches to track the owner cgroup.
> 
> o So far cfq always caches the async queue pointer. With async requests now
>   not necessarily being tied to submitting task io context, caching the
>   pointer will not help for async queues. This patch introduces a new config
>   option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
>   old behavior where async queue pointer is cached in task context. If it
>   is set, async queue pointer is not cached and we take help of bio
>   tracking patches to determine group bio belongs to and then map it to
>   async queue of that group.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 19/23] io-controller: map async requests to appropriate cgroup
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 17:39     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o So far we were assuming that a bio/rq belongs to the task who is submitting
>   it. It did not hold good in case of async writes. This patch makes use of
>   blkio_cgroup pataches to attribute the aysnc writes to right group instead
>   of task submitting the bio.
> 
> o For sync requests, we continue to assume that io belongs to the task
>   submitting it. Only in case of async requests, we make use of io tracking
>   patches to track the owner cgroup.
> 
> o So far cfq always caches the async queue pointer. With async requests now
>   not necessarily being tied to submitting task io context, caching the
>   pointer will not help for async queues. This patch introduces a new config
>   option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
>   old behavior where async queue pointer is cached in task context. If it
>   is set, async queue pointer is not cached and we take help of bio
>   tracking patches to determine group bio belongs to and then map it to
>   async queue of that group.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 19/23] io-controller: map async requests to appropriate cgroup
@ 2009-08-31 17:39     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o So far we were assuming that a bio/rq belongs to the task who is submitting
>   it. It did not hold good in case of async writes. This patch makes use of
>   blkio_cgroup pataches to attribute the aysnc writes to right group instead
>   of task submitting the bio.
> 
> o For sync requests, we continue to assume that io belongs to the task
>   submitting it. Only in case of async requests, we make use of io tracking
>   patches to track the owner cgroup.
> 
> o So far cfq always caches the async queue pointer. With async requests now
>   not necessarily being tied to submitting task io context, caching the
>   pointer will not help for async queues. This patch introduces a new config
>   option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
>   old behavior where async queue pointer is cached in task context. If it
>   is set, async queue pointer is not cached and we take help of bio
>   tracking patches to determine group bio belongs to and then map it to
>   async queue of that group.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
       [not found]   ` <1251495072-7780-21-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 17:54     ` Rik van Riel
  2009-09-14 18:33     ` Nauman Rafique
  1 sibling, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
>   sync and async requests. Once the request descriptors are consumed, new
>   processes are put to sleep and they effectively become serialized. Because
>   sync and async queues are separate, async requests don't impact sync ones
>   but if one is looking for fairness between async requests, that is not
>   achievable if request queue descriptors become bottleneck.
> 
> o Make request descriptor's per io group so that if there is lots of IO
>   going on in one cgroup, it does not impact the IO of other group.
> 
> o This patch implements the per cgroup request descriptors. request pool per
>   queue is still common but every group will have its own wait list and its
>   own count of request descriptors allocated to that group for sync and async
>   queues. So effectively request_list becomes per io group property and not a
>   global request queue feature.
> 
> o Currently one can define q->nr_requests to limit request descriptors
>   allocated for the queue. Now there is another tunable q->nr_group_requests
>   which controls the requests descriptr limit per group. q->nr_requests
>   supercedes q->nr_group_requests to make sure if there are lots of groups
>   present, we don't end up allocating too many request descriptors on the
>   queue.
> 
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 17:54     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
>   sync and async requests. Once the request descriptors are consumed, new
>   processes are put to sleep and they effectively become serialized. Because
>   sync and async queues are separate, async requests don't impact sync ones
>   but if one is looking for fairness between async requests, that is not
>   achievable if request queue descriptors become bottleneck.
> 
> o Make request descriptor's per io group so that if there is lots of IO
>   going on in one cgroup, it does not impact the IO of other group.
> 
> o This patch implements the per cgroup request descriptors. request pool per
>   queue is still common but every group will have its own wait list and its
>   own count of request descriptors allocated to that group for sync and async
>   queues. So effectively request_list becomes per io group property and not a
>   global request queue feature.
> 
> o Currently one can define q->nr_requests to limit request descriptors
>   allocated for the queue. Now there is another tunable q->nr_group_requests
>   which controls the requests descriptr limit per group. q->nr_requests
>   supercedes q->nr_group_requests to make sure if there are lots of groups
>   present, we don't end up allocating too many request descriptors on the
>   queue.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
@ 2009-08-31 17:54     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 17:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
>   sync and async requests. Once the request descriptors are consumed, new
>   processes are put to sleep and they effectively become serialized. Because
>   sync and async queues are separate, async requests don't impact sync ones
>   but if one is looking for fairness between async requests, that is not
>   achievable if request queue descriptors become bottleneck.
> 
> o Make request descriptor's per io group so that if there is lots of IO
>   going on in one cgroup, it does not impact the IO of other group.
> 
> o This patch implements the per cgroup request descriptors. request pool per
>   queue is still common but every group will have its own wait list and its
>   own count of request descriptors allocated to that group for sync and async
>   queues. So effectively request_list becomes per io group property and not a
>   global request queue feature.
> 
> o Currently one can define q->nr_requests to limit request descriptors
>   allocated for the queue. Now there is another tunable q->nr_group_requests
>   which controls the requests descriptr limit per group. q->nr_requests
>   supercedes q->nr_group_requests to make sure if there are lots of groups
>   present, we don't end up allocating too many request descriptors on the
>   queue.
> 
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]     ` <4A9C09BE.4060404-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 18:56       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 18:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o blkio_cgroup patches from Ryo to track async bios.
>>
>> o This functionality is used to determine the group of async IO from page
>>   instead of context of submitting task.
>>
>> Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
>> Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> This seems to be the most complex part of the code so far,
> but I see why this code is necessary.
>

Hi Rik,

Thanks for reviewing the patches. I wanted to have better understanding of
where all does it help to associate a bio to the group of process who
created/owned the page. Hence few thoughts.

When a bio is submitted to IO scheduler, it needs to determine the group
bio belongs to and group which should be charged to. There seem to be two 
methods.

- Attribute the bio to cgroup submitting process belongs to.
- For async requests, track the original owner hence cgroup of the page
  and charge that group for the bio. 

One can think of pros/cons of both the approaches.

- The primary use case of tracking async context seems be that if a
  process T1 in group G1 mmaps a big file and then another process T2 in
  group G2, asks for memory and triggers reclaim and generates writes of
  the file pages mapped by T1, then these writes should not be charged to
  T2, hence blkio_cgroup pages.

  But the flip side of this might be that group G2 is a low weight group
  and probably too busy also right now, which will delay the write out
  and possibly T2 will wait longer for memory to be allocated.

- At one point of time Andrew mentioned that buffered writes are generally a
  big problem and one needs to map these to owner's group. Though I am not
  very sure what specific problem he was referring to. Can we attribute
  buffered writes to pdflush threads and move all pdflush threads in a 
  cgroup to limit system wide write out activity?

- Somebody also gave an example where there is a memory hogging process and
  possibly pushes out some processes to swap. It does not sound fair to 
  charge those proccess for that swap writeout. These processes never
  requested swap IO.

- If there are multiple buffered writers in the system, then those writers
  can also be forced to writeout some pages to disk before they are
  allowed to dirty more pages. As per the page cache design, any writer
  can pick any inode and start writing out pages. So it can happen a
  weight group task is writting out pages dirtied by a lower weight group
  task. If, async bio is mapped to owner's group, it might happen that
  higher weight group task might be made to sleep on lower weight group
  task because request descriptors are all consumed up.

It looks like there does not seem to be a clean way which covers all the
cases without issues. I am just trying to think, what is a simple way
which covers most of the cases. Can we just stick to using submitting task
context to determine a bio's group (as cfq does). Which can result in
following.

- Less code and reduced complexity.

- Buffered writes will be charged to pdflush and its group. If one wish to
  limit buffered write activity for pdflush, one can move all the pdflush
  threads into a group and assign desired weight. Writes submitted in
  process context will continue to be charged to that process irrespective
  of the fact who dirtied that page.

- swap activity will be charged to kswapd and its group. If swap writes
  are coming from process context, it gets charged to process and its
  group. 

- If one is worried about the case of one process being charged for write
  out of file mapped by another process during reclaim, then we can
  probably make use of memory controller and mount memory controller and
  io controller together on same hierarchy. I am told that with memory
  controller, group's memory will be reclaimed by the process requesting
  more memory. If that's the case, then IO will automatically be charged
  to right group if we use submitting task context.

I just wanted to bring this point forward for more discussions to know
what is the right thing to do? Use bio tracking or not.

Ryo, any thoughts on this?

Thanks
Vivek

> Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-08-31 17:34     ` Rik van Riel
@ 2009-08-31 18:56       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 18:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o blkio_cgroup patches from Ryo to track async bios.
>>
>> o This functionality is used to determine the group of async IO from page
>>   instead of context of submitting task.
>>
>> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
>> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> This seems to be the most complex part of the code so far,
> but I see why this code is necessary.
>

Hi Rik,

Thanks for reviewing the patches. I wanted to have better understanding of
where all does it help to associate a bio to the group of process who
created/owned the page. Hence few thoughts.

When a bio is submitted to IO scheduler, it needs to determine the group
bio belongs to and group which should be charged to. There seem to be two 
methods.

- Attribute the bio to cgroup submitting process belongs to.
- For async requests, track the original owner hence cgroup of the page
  and charge that group for the bio. 

One can think of pros/cons of both the approaches.

- The primary use case of tracking async context seems be that if a
  process T1 in group G1 mmaps a big file and then another process T2 in
  group G2, asks for memory and triggers reclaim and generates writes of
  the file pages mapped by T1, then these writes should not be charged to
  T2, hence blkio_cgroup pages.

  But the flip side of this might be that group G2 is a low weight group
  and probably too busy also right now, which will delay the write out
  and possibly T2 will wait longer for memory to be allocated.

- At one point of time Andrew mentioned that buffered writes are generally a
  big problem and one needs to map these to owner's group. Though I am not
  very sure what specific problem he was referring to. Can we attribute
  buffered writes to pdflush threads and move all pdflush threads in a 
  cgroup to limit system wide write out activity?

- Somebody also gave an example where there is a memory hogging process and
  possibly pushes out some processes to swap. It does not sound fair to 
  charge those proccess for that swap writeout. These processes never
  requested swap IO.

- If there are multiple buffered writers in the system, then those writers
  can also be forced to writeout some pages to disk before they are
  allowed to dirty more pages. As per the page cache design, any writer
  can pick any inode and start writing out pages. So it can happen a
  weight group task is writting out pages dirtied by a lower weight group
  task. If, async bio is mapped to owner's group, it might happen that
  higher weight group task might be made to sleep on lower weight group
  task because request descriptors are all consumed up.

It looks like there does not seem to be a clean way which covers all the
cases without issues. I am just trying to think, what is a simple way
which covers most of the cases. Can we just stick to using submitting task
context to determine a bio's group (as cfq does). Which can result in
following.

- Less code and reduced complexity.

- Buffered writes will be charged to pdflush and its group. If one wish to
  limit buffered write activity for pdflush, one can move all the pdflush
  threads into a group and assign desired weight. Writes submitted in
  process context will continue to be charged to that process irrespective
  of the fact who dirtied that page.

- swap activity will be charged to kswapd and its group. If swap writes
  are coming from process context, it gets charged to process and its
  group. 

- If one is worried about the case of one process being charged for write
  out of file mapped by another process during reclaim, then we can
  probably make use of memory controller and mount memory controller and
  io controller together on same hierarchy. I am told that with memory
  controller, group's memory will be reclaimed by the process requesting
  more memory. If that's the case, then IO will automatically be charged
  to right group if we use submitting task context.

I just wanted to bring this point forward for more discussions to know
what is the right thing to do? Use bio tracking or not.

Ryo, any thoughts on this?

Thanks
Vivek

> Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-08-31 18:56       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 18:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o blkio_cgroup patches from Ryo to track async bios.
>>
>> o This functionality is used to determine the group of async IO from page
>>   instead of context of submitting task.
>>
>> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
>> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> This seems to be the most complex part of the code so far,
> but I see why this code is necessary.
>

Hi Rik,

Thanks for reviewing the patches. I wanted to have better understanding of
where all does it help to associate a bio to the group of process who
created/owned the page. Hence few thoughts.

When a bio is submitted to IO scheduler, it needs to determine the group
bio belongs to and group which should be charged to. There seem to be two 
methods.

- Attribute the bio to cgroup submitting process belongs to.
- For async requests, track the original owner hence cgroup of the page
  and charge that group for the bio. 

One can think of pros/cons of both the approaches.

- The primary use case of tracking async context seems be that if a
  process T1 in group G1 mmaps a big file and then another process T2 in
  group G2, asks for memory and triggers reclaim and generates writes of
  the file pages mapped by T1, then these writes should not be charged to
  T2, hence blkio_cgroup pages.

  But the flip side of this might be that group G2 is a low weight group
  and probably too busy also right now, which will delay the write out
  and possibly T2 will wait longer for memory to be allocated.

- At one point of time Andrew mentioned that buffered writes are generally a
  big problem and one needs to map these to owner's group. Though I am not
  very sure what specific problem he was referring to. Can we attribute
  buffered writes to pdflush threads and move all pdflush threads in a 
  cgroup to limit system wide write out activity?

- Somebody also gave an example where there is a memory hogging process and
  possibly pushes out some processes to swap. It does not sound fair to 
  charge those proccess for that swap writeout. These processes never
  requested swap IO.

- If there are multiple buffered writers in the system, then those writers
  can also be forced to writeout some pages to disk before they are
  allowed to dirty more pages. As per the page cache design, any writer
  can pick any inode and start writing out pages. So it can happen a
  weight group task is writting out pages dirtied by a lower weight group
  task. If, async bio is mapped to owner's group, it might happen that
  higher weight group task might be made to sleep on lower weight group
  task because request descriptors are all consumed up.

It looks like there does not seem to be a clean way which covers all the
cases without issues. I am just trying to think, what is a simple way
which covers most of the cases. Can we just stick to using submitting task
context to determine a bio's group (as cfq does). Which can result in
following.

- Less code and reduced complexity.

- Buffered writes will be charged to pdflush and its group. If one wish to
  limit buffered write activity for pdflush, one can move all the pdflush
  threads into a group and assign desired weight. Writes submitted in
  process context will continue to be charged to that process irrespective
  of the fact who dirtied that page.

- swap activity will be charged to kswapd and its group. If swap writes
  are coming from process context, it gets charged to process and its
  group. 

- If one is worried about the case of one process being charged for write
  out of file mapped by another process during reclaim, then we can
  probably make use of memory controller and mount memory controller and
  io controller together on same hierarchy. I am told that with memory
  controller, group's memory will be reclaimed by the process requesting
  more memory. If that's the case, then IO will automatically be charged
  to right group if we use submitting task context.

I just wanted to bring this point forward for more discussions to know
what is the right thing to do? Use bio tracking or not.

Ryo, any thoughts on this?

Thanks
Vivek

> Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 21/23] io-controller: Per io group bdi congestion interface
       [not found]   ` <1251495072-7780-22-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 19:49     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o So far there used to be only one pair or queue  of request descriptors
>   (one for sync and one for async) per device and number of requests allocated
>   used to decide whether associated bdi is congested or not.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 21/23] io-controller: Per io group bdi congestion interface
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 19:49     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o So far there used to be only one pair or queue  of request descriptors
>   (one for sync and one for async) per device and number of requests allocated
>   used to decide whether associated bdi is congested or not.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 21/23] io-controller: Per io group bdi congestion interface
@ 2009-08-31 19:49     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o So far there used to be only one pair or queue  of request descriptors
>   (one for sync and one for async) per device and number of requests allocated
>   used to decide whether associated bdi is congested or not.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
       [not found]   ` <1251495072-7780-23-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 20:56     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 20:56     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 22/23] io-controller: Support per cgroup per device weights and io class
@ 2009-08-31 20:56     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
       [not found]   ` <1251495072-7780-24-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 20:57     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o More debugging help to debug elevator fair queuing support. Enabled under
>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>   trace messages in blktrace.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Is this meant for merging upstream, or just as a temporary
debugging help while this sits in a subsystem tree or -mm?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-08-31 20:57     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o More debugging help to debug elevator fair queuing support. Enabled under
>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>   trace messages in blktrace.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Is this meant for merging upstream, or just as a temporary
debugging help while this sits in a subsystem tree or -mm?


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
@ 2009-08-31 20:57     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 20:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o More debugging help to debug elevator fair queuing support. Enabled under
>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>   trace messages in blktrace.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Is this meant for merging upstream, or just as a temporary
debugging help while this sits in a subsystem tree or -mm?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
       [not found]     ` <4A9C3951.8020302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 21:01       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 21:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o More debugging help to debug elevator fair queuing support. Enabled under
>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>   trace messages in blktrace.
>>
>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> Is this meant for merging upstream, or just as a temporary
> debugging help while this sits in a subsystem tree or -mm?

I think it would be good if this also is merged upstream. A useful
debugging help to track fairness and latecy related issues.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
  2009-08-31 20:57     ` Rik van Riel
@ 2009-08-31 21:01       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 21:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o More debugging help to debug elevator fair queuing support. Enabled under
>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>   trace messages in blktrace.
>>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Is this meant for merging upstream, or just as a temporary
> debugging help while this sits in a subsystem tree or -mm?

I think it would be good if this also is merged upstream. A useful
debugging help to track fairness and latecy related issues.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
@ 2009-08-31 21:01       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-31 21:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> o More debugging help to debug elevator fair queuing support. Enabled under
>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>   trace messages in blktrace.
>>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Is this meant for merging upstream, or just as a temporary
> debugging help while this sits in a subsystem tree or -mm?

I think it would be good if this also is merged upstream. A useful
debugging help to track fairness and latecy related issues.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
       [not found]       ` <20090831210154.GA8229-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 21:12         ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 21:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o More debugging help to debug elevator fair queuing support. Enabled under
>>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>>   trace messages in blktrace.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Is this meant for merging upstream, or just as a temporary
>> debugging help while this sits in a subsystem tree or -mm?
> 
> I think it would be good if this also is merged upstream. A useful
> debugging help to track fairness and latecy related issues.

Fair enough.  The code is small enough, anyway.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
  2009-08-31 21:01       ` Vivek Goyal
@ 2009-08-31 21:12         ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 21:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o More debugging help to debug elevator fair queuing support. Enabled under
>>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>>   trace messages in blktrace.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> Is this meant for merging upstream, or just as a temporary
>> debugging help while this sits in a subsystem tree or -mm?
> 
> I think it would be good if this also is merged upstream. A useful
> debugging help to track fairness and latecy related issues.

Fair enough.  The code is small enough, anyway.

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 23/23] io-controller: debug elevator fair queuing support
@ 2009-08-31 21:12         ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-08-31 21:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> On Mon, Aug 31, 2009 at 04:57:53PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o More debugging help to debug elevator fair queuing support. Enabled under
>>>   CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
>>>   trace messages in blktrace.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> Is this meant for merging upstream, or just as a temporary
>> debugging help while this sits in a subsystem tree or -mm?
> 
> I think it would be good if this also is merged upstream. A useful
> debugging help to track fairness and latecy related issues.

Fair enough.  The code is small enough, anyway.

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]       ` <20090831185640.GF3758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-31 23:51         ` Nauman Rafique
  0 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-08-31 23:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, Rik van Riel,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 31, 2009 at 11:56 AM, Vivek Goyal<vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o blkio_cgroup patches from Ryo to track async bios.
>>>
>>> o This functionality is used to determine the group of async IO from page
>>>   instead of context of submitting task.
>>>
>>> Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
>>> Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
>>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> This seems to be the most complex part of the code so far,
>> but I see why this code is necessary.
>>
>
> Hi Rik,
>
> Thanks for reviewing the patches. I wanted to have better understanding of
> where all does it help to associate a bio to the group of process who
> created/owned the page. Hence few thoughts.
>
> When a bio is submitted to IO scheduler, it needs to determine the group
> bio belongs to and group which should be charged to. There seem to be two
> methods.
>
> - Attribute the bio to cgroup submitting process belongs to.
> - For async requests, track the original owner hence cgroup of the page
>  and charge that group for the bio.
>
> One can think of pros/cons of both the approaches.
>
> - The primary use case of tracking async context seems be that if a
>  process T1 in group G1 mmaps a big file and then another process T2 in
>  group G2, asks for memory and triggers reclaim and generates writes of
>  the file pages mapped by T1, then these writes should not be charged to
>  T2, hence blkio_cgroup pages.
>
>  But the flip side of this might be that group G2 is a low weight group
>  and probably too busy also right now, which will delay the write out
>  and possibly T2 will wait longer for memory to be allocated.
>
> - At one point of time Andrew mentioned that buffered writes are generally a
>  big problem and one needs to map these to owner's group. Though I am not
>  very sure what specific problem he was referring to. Can we attribute
>  buffered writes to pdflush threads and move all pdflush threads in a
>  cgroup to limit system wide write out activity?
>
> - Somebody also gave an example where there is a memory hogging process and
>  possibly pushes out some processes to swap. It does not sound fair to
>  charge those proccess for that swap writeout. These processes never
>  requested swap IO.
>
> - If there are multiple buffered writers in the system, then those writers
>  can also be forced to writeout some pages to disk before they are
>  allowed to dirty more pages. As per the page cache design, any writer
>  can pick any inode and start writing out pages. So it can happen a
>  weight group task is writting out pages dirtied by a lower weight group
>  task. If, async bio is mapped to owner's group, it might happen that
>  higher weight group task might be made to sleep on lower weight group
>  task because request descriptors are all consumed up.
>
> It looks like there does not seem to be a clean way which covers all the
> cases without issues. I am just trying to think, what is a simple way
> which covers most of the cases. Can we just stick to using submitting task
> context to determine a bio's group (as cfq does). Which can result in
> following.
>
> - Less code and reduced complexity.
>
> - Buffered writes will be charged to pdflush and its group. If one wish to
>  limit buffered write activity for pdflush, one can move all the pdflush
>  threads into a group and assign desired weight. Writes submitted in
>  process context will continue to be charged to that process irrespective
>  of the fact who dirtied that page.

What if we wanted to control buffered write activity per group? If a
group keeps dirtying pages, we wouldn't want it to dominate the disk
IO capacity at the expense of other cgroups (by dominating the writes
sent down by pdflush).

>
> - swap activity will be charged to kswapd and its group. If swap writes
>  are coming from process context, it gets charged to process and its
>  group.
>
> - If one is worried about the case of one process being charged for write
>  out of file mapped by another process during reclaim, then we can
>  probably make use of memory controller and mount memory controller and
>  io controller together on same hierarchy. I am told that with memory
>  controller, group's memory will be reclaimed by the process requesting
>  more memory. If that's the case, then IO will automatically be charged
>  to right group if we use submitting task context.
>
> I just wanted to bring this point forward for more discussions to know
> what is the right thing to do? Use bio tracking or not.
>
> Ryo, any thoughts on this?
>
> Thanks
> Vivek
>
>> Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to  track async bios.
  2009-08-31 18:56       ` Vivek Goyal
@ 2009-08-31 23:51         ` Nauman Rafique
  -1 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-08-31 23:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Rik van Riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Mon, Aug 31, 2009 at 11:56 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o blkio_cgroup patches from Ryo to track async bios.
>>>
>>> o This functionality is used to determine the group of async IO from page
>>>   instead of context of submitting task.
>>>
>>> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
>>> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>>
>> This seems to be the most complex part of the code so far,
>> but I see why this code is necessary.
>>
>
> Hi Rik,
>
> Thanks for reviewing the patches. I wanted to have better understanding of
> where all does it help to associate a bio to the group of process who
> created/owned the page. Hence few thoughts.
>
> When a bio is submitted to IO scheduler, it needs to determine the group
> bio belongs to and group which should be charged to. There seem to be two
> methods.
>
> - Attribute the bio to cgroup submitting process belongs to.
> - For async requests, track the original owner hence cgroup of the page
>  and charge that group for the bio.
>
> One can think of pros/cons of both the approaches.
>
> - The primary use case of tracking async context seems be that if a
>  process T1 in group G1 mmaps a big file and then another process T2 in
>  group G2, asks for memory and triggers reclaim and generates writes of
>  the file pages mapped by T1, then these writes should not be charged to
>  T2, hence blkio_cgroup pages.
>
>  But the flip side of this might be that group G2 is a low weight group
>  and probably too busy also right now, which will delay the write out
>  and possibly T2 will wait longer for memory to be allocated.
>
> - At one point of time Andrew mentioned that buffered writes are generally a
>  big problem and one needs to map these to owner's group. Though I am not
>  very sure what specific problem he was referring to. Can we attribute
>  buffered writes to pdflush threads and move all pdflush threads in a
>  cgroup to limit system wide write out activity?
>
> - Somebody also gave an example where there is a memory hogging process and
>  possibly pushes out some processes to swap. It does not sound fair to
>  charge those proccess for that swap writeout. These processes never
>  requested swap IO.
>
> - If there are multiple buffered writers in the system, then those writers
>  can also be forced to writeout some pages to disk before they are
>  allowed to dirty more pages. As per the page cache design, any writer
>  can pick any inode and start writing out pages. So it can happen a
>  weight group task is writting out pages dirtied by a lower weight group
>  task. If, async bio is mapped to owner's group, it might happen that
>  higher weight group task might be made to sleep on lower weight group
>  task because request descriptors are all consumed up.
>
> It looks like there does not seem to be a clean way which covers all the
> cases without issues. I am just trying to think, what is a simple way
> which covers most of the cases. Can we just stick to using submitting task
> context to determine a bio's group (as cfq does). Which can result in
> following.
>
> - Less code and reduced complexity.
>
> - Buffered writes will be charged to pdflush and its group. If one wish to
>  limit buffered write activity for pdflush, one can move all the pdflush
>  threads into a group and assign desired weight. Writes submitted in
>  process context will continue to be charged to that process irrespective
>  of the fact who dirtied that page.

What if we wanted to control buffered write activity per group? If a
group keeps dirtying pages, we wouldn't want it to dominate the disk
IO capacity at the expense of other cgroups (by dominating the writes
sent down by pdflush).

>
> - swap activity will be charged to kswapd and its group. If swap writes
>  are coming from process context, it gets charged to process and its
>  group.
>
> - If one is worried about the case of one process being charged for write
>  out of file mapped by another process during reclaim, then we can
>  probably make use of memory controller and mount memory controller and
>  io controller together on same hierarchy. I am told that with memory
>  controller, group's memory will be reclaimed by the process requesting
>  more memory. If that's the case, then IO will automatically be charged
>  to right group if we use submitting task context.
>
> I just wanted to bring this point forward for more discussions to know
> what is the right thing to do? Use bio tracking or not.
>
> Ryo, any thoughts on this?
>
> Thanks
> Vivek
>
>> Acked-by: Rik van Riel <riel@redhat.com>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-08-31 23:51         ` Nauman Rafique
  0 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-08-31 23:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Rik van Riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Mon, Aug 31, 2009 at 11:56 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Mon, Aug 31, 2009 at 01:34:54PM -0400, Rik van Riel wrote:
>> Vivek Goyal wrote:
>>> o blkio_cgroup patches from Ryo to track async bios.
>>>
>>> o This functionality is used to determine the group of async IO from page
>>>   instead of context of submitting task.
>>>
>>> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
>>> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>>
>> This seems to be the most complex part of the code so far,
>> but I see why this code is necessary.
>>
>
> Hi Rik,
>
> Thanks for reviewing the patches. I wanted to have better understanding of
> where all does it help to associate a bio to the group of process who
> created/owned the page. Hence few thoughts.
>
> When a bio is submitted to IO scheduler, it needs to determine the group
> bio belongs to and group which should be charged to. There seem to be two
> methods.
>
> - Attribute the bio to cgroup submitting process belongs to.
> - For async requests, track the original owner hence cgroup of the page
>  and charge that group for the bio.
>
> One can think of pros/cons of both the approaches.
>
> - The primary use case of tracking async context seems be that if a
>  process T1 in group G1 mmaps a big file and then another process T2 in
>  group G2, asks for memory and triggers reclaim and generates writes of
>  the file pages mapped by T1, then these writes should not be charged to
>  T2, hence blkio_cgroup pages.
>
>  But the flip side of this might be that group G2 is a low weight group
>  and probably too busy also right now, which will delay the write out
>  and possibly T2 will wait longer for memory to be allocated.
>
> - At one point of time Andrew mentioned that buffered writes are generally a
>  big problem and one needs to map these to owner's group. Though I am not
>  very sure what specific problem he was referring to. Can we attribute
>  buffered writes to pdflush threads and move all pdflush threads in a
>  cgroup to limit system wide write out activity?
>
> - Somebody also gave an example where there is a memory hogging process and
>  possibly pushes out some processes to swap. It does not sound fair to
>  charge those proccess for that swap writeout. These processes never
>  requested swap IO.
>
> - If there are multiple buffered writers in the system, then those writers
>  can also be forced to writeout some pages to disk before they are
>  allowed to dirty more pages. As per the page cache design, any writer
>  can pick any inode and start writing out pages. So it can happen a
>  weight group task is writting out pages dirtied by a lower weight group
>  task. If, async bio is mapped to owner's group, it might happen that
>  higher weight group task might be made to sleep on lower weight group
>  task because request descriptors are all consumed up.
>
> It looks like there does not seem to be a clean way which covers all the
> cases without issues. I am just trying to think, what is a simple way
> which covers most of the cases. Can we just stick to using submitting task
> context to determine a bio's group (as cfq does). Which can result in
> following.
>
> - Less code and reduced complexity.
>
> - Buffered writes will be charged to pdflush and its group. If one wish to
>  limit buffered write activity for pdflush, one can move all the pdflush
>  threads into a group and assign desired weight. Writes submitted in
>  process context will continue to be charged to that process irrespective
>  of the fact who dirtied that page.

What if we wanted to control buffered write activity per group? If a
group keeps dirtying pages, we wouldn't want it to dominate the disk
IO capacity at the expense of other cgroups (by dominating the writes
sent down by pdflush).

>
> - swap activity will be charged to kswapd and its group. If swap writes
>  are coming from process context, it gets charged to process and its
>  group.
>
> - If one is worried about the case of one process being charged for write
>  out of file mapped by another process during reclaim, then we can
>  probably make use of memory controller and mount memory controller and
>  io controller together on same hierarchy. I am told that with memory
>  controller, group's memory will be reclaimed by the process requesting
>  more memory. If that's the case, then IO will automatically be charged
>  to right group if we use submitting task context.
>
> I just wanted to bring this point forward for more discussions to know
> what is the right thing to do? Use bio tracking or not.
>
> Ryo, any thoughts on this?
>
> Thanks
> Vivek
>
>> Acked-by: Rik van Riel <riel@redhat.com>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]         ` <e98e18940908311651s26de5b70ye6f4a82402956309-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-01  7:00           ` Ryo Tsuruta
  0 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-01  7:00 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

> > Hi Rik,
> >
> > Thanks for reviewing the patches. I wanted to have better understanding of
> > where all does it help to associate a bio to the group of process who
> > created/owned the page. Hence few thoughts.
> >
> > When a bio is submitted to IO scheduler, it needs to determine the group
> > bio belongs to and group which should be charged to. There seem to be two
> > methods.
> >
> > - Attribute the bio to cgroup submitting process belongs to.
> > - For async requests, track the original owner hence cgroup of the page
> >  and charge that group for the bio.
> >
> > One can think of pros/cons of both the approaches.
> >
> > - The primary use case of tracking async context seems be that if a
> >  process T1 in group G1 mmaps a big file and then another process T2 in
> >  group G2, asks for memory and triggers reclaim and generates writes of
> >  the file pages mapped by T1, then these writes should not be charged to
> >  T2, hence blkio_cgroup pages.
> >
> >  But the flip side of this might be that group G2 is a low weight group
> >  and probably too busy also right now, which will delay the write out
> >  and possibly T2 will wait longer for memory to be allocated.

In order to avoid this wait, dm-ioband issues IO which has a page with
PG_Reclaim as early as possible.

> > - At one point of time Andrew mentioned that buffered writes are generally a
> >  big problem and one needs to map these to owner's group. Though I am not
> >  very sure what specific problem he was referring to. Can we attribute
> >  buffered writes to pdflush threads and move all pdflush threads in a
> >  cgroup to limit system wide write out activity?

I think that buffered writes also should be controlled per cgroup as
well as synchronous writes.

> > - Somebody also gave an example where there is a memory hogging process and
> >  possibly pushes out some processes to swap. It does not sound fair to
> >  charge those proccess for that swap writeout. These processes never
> >  requested swap IO.

I think that swap writeouts should be charged to the memory hogging
process, because the process consumes more resources and it should get
a penalty.

> > - If there are multiple buffered writers in the system, then those writers
> >  can also be forced to writeout some pages to disk before they are
> >  allowed to dirty more pages. As per the page cache design, any writer
> >  can pick any inode and start writing out pages. So it can happen a
> >  weight group task is writting out pages dirtied by a lower weight group
> >  task. If, async bio is mapped to owner's group, it might happen that
> >  higher weight group task might be made to sleep on lower weight group
> >  task because request descriptors are all consumed up.

As mentioned above, in dm-ioband, the bio is charged to the page owner
and issued immediately.

> > It looks like there does not seem to be a clean way which covers all the
> > cases without issues. I am just trying to think, what is a simple way
> > which covers most of the cases. Can we just stick to using submitting task
> > context to determine a bio's group (as cfq does). Which can result in
> > following.
> >
> > - Less code and reduced complexity.
> >
> > - Buffered writes will be charged to pdflush and its group. If one wish to
> >  limit buffered write activity for pdflush, one can move all the pdflush
> >  threads into a group and assign desired weight. Writes submitted in
> >  process context will continue to be charged to that process irrespective
> >  of the fact who dirtied that page.
> 
> What if we wanted to control buffered write activity per group? If a
> group keeps dirtying pages, we wouldn't want it to dominate the disk
> IO capacity at the expense of other cgroups (by dominating the writes
> sent down by pdflush).

Yes, I think that is true.

> > - swap activity will be charged to kswapd and its group. If swap writes
> >  are coming from process context, it gets charged to process and its
> >  group.
> >
> > - If one is worried about the case of one process being charged for write
> >  out of file mapped by another process during reclaim, then we can
> >  probably make use of memory controller and mount memory controller and
> >  io controller together on same hierarchy. I am told that with memory
> >  controller, group's memory will be reclaimed by the process requesting
> >  more memory. If that's the case, then IO will automatically be charged
> >  to right group if we use submitting task context.
> >
> > I just wanted to bring this point forward for more discussions to know
> > what is the right thing to do? Use bio tracking or not.

Thanks for bringing it forward.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-08-31 23:51         ` Nauman Rafique
@ 2009-09-01  7:00           ` Ryo Tsuruta
  -1 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-01  7:00 UTC (permalink / raw)
  To: nauman
  Cc: vgoyal, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

Hi,

> > Hi Rik,
> >
> > Thanks for reviewing the patches. I wanted to have better understanding of
> > where all does it help to associate a bio to the group of process who
> > created/owned the page. Hence few thoughts.
> >
> > When a bio is submitted to IO scheduler, it needs to determine the group
> > bio belongs to and group which should be charged to. There seem to be two
> > methods.
> >
> > - Attribute the bio to cgroup submitting process belongs to.
> > - For async requests, track the original owner hence cgroup of the page
> >  and charge that group for the bio.
> >
> > One can think of pros/cons of both the approaches.
> >
> > - The primary use case of tracking async context seems be that if a
> >  process T1 in group G1 mmaps a big file and then another process T2 in
> >  group G2, asks for memory and triggers reclaim and generates writes of
> >  the file pages mapped by T1, then these writes should not be charged to
> >  T2, hence blkio_cgroup pages.
> >
> >  But the flip side of this might be that group G2 is a low weight group
> >  and probably too busy also right now, which will delay the write out
> >  and possibly T2 will wait longer for memory to be allocated.

In order to avoid this wait, dm-ioband issues IO which has a page with
PG_Reclaim as early as possible.

> > - At one point of time Andrew mentioned that buffered writes are generally a
> >  big problem and one needs to map these to owner's group. Though I am not
> >  very sure what specific problem he was referring to. Can we attribute
> >  buffered writes to pdflush threads and move all pdflush threads in a
> >  cgroup to limit system wide write out activity?

I think that buffered writes also should be controlled per cgroup as
well as synchronous writes.

> > - Somebody also gave an example where there is a memory hogging process and
> >  possibly pushes out some processes to swap. It does not sound fair to
> >  charge those proccess for that swap writeout. These processes never
> >  requested swap IO.

I think that swap writeouts should be charged to the memory hogging
process, because the process consumes more resources and it should get
a penalty.

> > - If there are multiple buffered writers in the system, then those writers
> >  can also be forced to writeout some pages to disk before they are
> >  allowed to dirty more pages. As per the page cache design, any writer
> >  can pick any inode and start writing out pages. So it can happen a
> >  weight group task is writting out pages dirtied by a lower weight group
> >  task. If, async bio is mapped to owner's group, it might happen that
> >  higher weight group task might be made to sleep on lower weight group
> >  task because request descriptors are all consumed up.

As mentioned above, in dm-ioband, the bio is charged to the page owner
and issued immediately.

> > It looks like there does not seem to be a clean way which covers all the
> > cases without issues. I am just trying to think, what is a simple way
> > which covers most of the cases. Can we just stick to using submitting task
> > context to determine a bio's group (as cfq does). Which can result in
> > following.
> >
> > - Less code and reduced complexity.
> >
> > - Buffered writes will be charged to pdflush and its group. If one wish to
> >  limit buffered write activity for pdflush, one can move all the pdflush
> >  threads into a group and assign desired weight. Writes submitted in
> >  process context will continue to be charged to that process irrespective
> >  of the fact who dirtied that page.
> 
> What if we wanted to control buffered write activity per group? If a
> group keeps dirtying pages, we wouldn't want it to dominate the disk
> IO capacity at the expense of other cgroups (by dominating the writes
> sent down by pdflush).

Yes, I think that is true.

> > - swap activity will be charged to kswapd and its group. If swap writes
> >  are coming from process context, it gets charged to process and its
> >  group.
> >
> > - If one is worried about the case of one process being charged for write
> >  out of file mapped by another process during reclaim, then we can
> >  probably make use of memory controller and mount memory controller and
> >  io controller together on same hierarchy. I am told that with memory
> >  controller, group's memory will be reclaimed by the process requesting
> >  more memory. If that's the case, then IO will automatically be charged
> >  to right group if we use submitting task context.
> >
> > I just wanted to bring this point forward for more discussions to know
> > what is the right thing to do? Use bio tracking or not.

Thanks for bringing it forward.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-01  7:00           ` Ryo Tsuruta
  0 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-01  7:00 UTC (permalink / raw)
  To: nauman
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	mingo, vgoyal, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

Hi,

> > Hi Rik,
> >
> > Thanks for reviewing the patches. I wanted to have better understanding of
> > where all does it help to associate a bio to the group of process who
> > created/owned the page. Hence few thoughts.
> >
> > When a bio is submitted to IO scheduler, it needs to determine the group
> > bio belongs to and group which should be charged to. There seem to be two
> > methods.
> >
> > - Attribute the bio to cgroup submitting process belongs to.
> > - For async requests, track the original owner hence cgroup of the page
> >  and charge that group for the bio.
> >
> > One can think of pros/cons of both the approaches.
> >
> > - The primary use case of tracking async context seems be that if a
> >  process T1 in group G1 mmaps a big file and then another process T2 in
> >  group G2, asks for memory and triggers reclaim and generates writes of
> >  the file pages mapped by T1, then these writes should not be charged to
> >  T2, hence blkio_cgroup pages.
> >
> >  But the flip side of this might be that group G2 is a low weight group
> >  and probably too busy also right now, which will delay the write out
> >  and possibly T2 will wait longer for memory to be allocated.

In order to avoid this wait, dm-ioband issues IO which has a page with
PG_Reclaim as early as possible.

> > - At one point of time Andrew mentioned that buffered writes are generally a
> >  big problem and one needs to map these to owner's group. Though I am not
> >  very sure what specific problem he was referring to. Can we attribute
> >  buffered writes to pdflush threads and move all pdflush threads in a
> >  cgroup to limit system wide write out activity?

I think that buffered writes also should be controlled per cgroup as
well as synchronous writes.

> > - Somebody also gave an example where there is a memory hogging process and
> >  possibly pushes out some processes to swap. It does not sound fair to
> >  charge those proccess for that swap writeout. These processes never
> >  requested swap IO.

I think that swap writeouts should be charged to the memory hogging
process, because the process consumes more resources and it should get
a penalty.

> > - If there are multiple buffered writers in the system, then those writers
> >  can also be forced to writeout some pages to disk before they are
> >  allowed to dirty more pages. As per the page cache design, any writer
> >  can pick any inode and start writing out pages. So it can happen a
> >  weight group task is writting out pages dirtied by a lower weight group
> >  task. If, async bio is mapped to owner's group, it might happen that
> >  higher weight group task might be made to sleep on lower weight group
> >  task because request descriptors are all consumed up.

As mentioned above, in dm-ioband, the bio is charged to the page owner
and issued immediately.

> > It looks like there does not seem to be a clean way which covers all the
> > cases without issues. I am just trying to think, what is a simple way
> > which covers most of the cases. Can we just stick to using submitting task
> > context to determine a bio's group (as cfq does). Which can result in
> > following.
> >
> > - Less code and reduced complexity.
> >
> > - Buffered writes will be charged to pdflush and its group. If one wish to
> >  limit buffered write activity for pdflush, one can move all the pdflush
> >  threads into a group and assign desired weight. Writes submitted in
> >  process context will continue to be charged to that process irrespective
> >  of the fact who dirtied that page.
> 
> What if we wanted to control buffered write activity per group? If a
> group keeps dirtying pages, we wouldn't want it to dominate the disk
> IO capacity at the expense of other cgroups (by dominating the writes
> sent down by pdflush).

Yes, I think that is true.

> > - swap activity will be charged to kswapd and its group. If swap writes
> >  are coming from process context, it gets charged to process and its
> >  group.
> >
> > - If one is worried about the case of one process being charged for write
> >  out of file mapped by another process during reclaim, then we can
> >  probably make use of memory controller and mount memory controller and
> >  io controller together on same hierarchy. I am told that with memory
> >  controller, group's memory will be reclaimed by the process requesting
> >  more memory. If that's the case, then IO will automatically be charged
> >  to right group if we use submitting task context.
> >
> > I just wanted to bring this point forward for more discussions to know
> > what is the right thing to do? Use bio tracking or not.

Thanks for bringing it forward.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]           ` <20090901.160004.226800357.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-01 14:11             ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-01 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> > > Hi Rik,
> > >
> > > Thanks for reviewing the patches. I wanted to have better understanding of
> > > where all does it help to associate a bio to the group of process who
> > > created/owned the page. Hence few thoughts.
> > >
> > > When a bio is submitted to IO scheduler, it needs to determine the group
> > > bio belongs to and group which should be charged to. There seem to be two
> > > methods.
> > >
> > > - Attribute the bio to cgroup submitting process belongs to.
> > > - For async requests, track the original owner hence cgroup of the page
> > >  and charge that group for the bio.
> > >
> > > One can think of pros/cons of both the approaches.
> > >
> > > - The primary use case of tracking async context seems be that if a
> > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > >  group G2, asks for memory and triggers reclaim and generates writes of
> > >  the file pages mapped by T1, then these writes should not be charged to
> > >  T2, hence blkio_cgroup pages.
> > >
> > >  But the flip side of this might be that group G2 is a low weight group
> > >  and probably too busy also right now, which will delay the write out
> > >  and possibly T2 will wait longer for memory to be allocated.
> 
> In order to avoid this wait, dm-ioband issues IO which has a page with
> PG_Reclaim as early as possible.
> 

So in above case IO is still charged to G2 but you keep a track if page is
PG_Reclaim then releae the this bio before other bios queued up in the
group?

> > > - At one point of time Andrew mentioned that buffered writes are generally a
> > >  big problem and one needs to map these to owner's group. Though I am not
> > >  very sure what specific problem he was referring to. Can we attribute
> > >  buffered writes to pdflush threads and move all pdflush threads in a
> > >  cgroup to limit system wide write out activity?
> 
> I think that buffered writes also should be controlled per cgroup as
> well as synchronous writes.
> 

But it is hard to achieve fairness for buffered writes becase we don't
create complete parallel IO paths and not necessarily higher weight
process dispatches more buffered writes to IO scheduler. (Due to page
cache buffered write logic).

So in some cases we might see buffered write fairness and in other cases
not. For example, run two dd processes in two groups doing buffered writes
and it is hard to achieve fairness between these.

That's why the idea that if we can't ensure Buffered write vs Buffered
write fairness in all the cases, then does it make sense to attribute
buffered writes to pdflush and put pdflush threads into a separate group
to limit system wide write out activity. 

> > > - Somebody also gave an example where there is a memory hogging process and
> > >  possibly pushes out some processes to swap. It does not sound fair to
> > >  charge those proccess for that swap writeout. These processes never
> > >  requested swap IO.
> 
> I think that swap writeouts should be charged to the memory hogging
> process, because the process consumes more resources and it should get
> a penalty.
> 

A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
mechanism and kernel's way of providing extended RAM. If we want to solve
the issue of memory hogging by a process then right way to solve is to use
memory controller and not by charging the process for IO activity.
Instead, proabably a more suitable way is to charge swap activity to root
group (where by default all the kernel related activity goes).   

> > > - If there are multiple buffered writers in the system, then those writers
> > >  can also be forced to writeout some pages to disk before they are
> > >  allowed to dirty more pages. As per the page cache design, any writer
> > >  can pick any inode and start writing out pages. So it can happen a
> > >  weight group task is writting out pages dirtied by a lower weight group
> > >  task. If, async bio is mapped to owner's group, it might happen that
> > >  higher weight group task might be made to sleep on lower weight group
> > >  task because request descriptors are all consumed up.
> 
> As mentioned above, in dm-ioband, the bio is charged to the page owner
> and issued immediately.

But you are doing it only for selected pages and not for all buffered
writes?

> 
> > > It looks like there does not seem to be a clean way which covers all the
> > > cases without issues. I am just trying to think, what is a simple way
> > > which covers most of the cases. Can we just stick to using submitting task
> > > context to determine a bio's group (as cfq does). Which can result in
> > > following.
> > >
> > > - Less code and reduced complexity.
> > >
> > > - Buffered writes will be charged to pdflush and its group. If one wish to
> > >  limit buffered write activity for pdflush, one can move all the pdflush
> > >  threads into a group and assign desired weight. Writes submitted in
> > >  process context will continue to be charged to that process irrespective
> > >  of the fact who dirtied that page.
> > 
> > What if we wanted to control buffered write activity per group? If a
> > group keeps dirtying pages, we wouldn't want it to dominate the disk
> > IO capacity at the expense of other cgroups (by dominating the writes
> > sent down by pdflush).
> 
> Yes, I think that is true.
> 

But anyway we are not able to gurantee this isolation in all the cases.
Again I go back to example of two dd threads doing buffered writes in two
groups.

I don't mind keeping it. Just wanted to make sure that we agree and
understand that keeping it does not mean that we get buffered write vs
buffered write isolation/fairness in all the cases.

> > > - swap activity will be charged to kswapd and its group. If swap writes
> > >  are coming from process context, it gets charged to process and its
> > >  group.
> > >
> > > - If one is worried about the case of one process being charged for write
> > >  out of file mapped by another process during reclaim, then we can
> > >  probably make use of memory controller and mount memory controller and
> > >  io controller together on same hierarchy. I am told that with memory
> > >  controller, group's memory will be reclaimed by the process requesting
> > >  more memory. If that's the case, then IO will automatically be charged
> > >  to right group if we use submitting task context.
> > >
> > > I just wanted to bring this point forward for more discussions to know
> > > what is the right thing to do? Use bio tracking or not.
> 
> Thanks for bringing it forward.
> 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-01  7:00           ` Ryo Tsuruta
@ 2009-09-01 14:11             ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-01 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> > > Hi Rik,
> > >
> > > Thanks for reviewing the patches. I wanted to have better understanding of
> > > where all does it help to associate a bio to the group of process who
> > > created/owned the page. Hence few thoughts.
> > >
> > > When a bio is submitted to IO scheduler, it needs to determine the group
> > > bio belongs to and group which should be charged to. There seem to be two
> > > methods.
> > >
> > > - Attribute the bio to cgroup submitting process belongs to.
> > > - For async requests, track the original owner hence cgroup of the page
> > >  and charge that group for the bio.
> > >
> > > One can think of pros/cons of both the approaches.
> > >
> > > - The primary use case of tracking async context seems be that if a
> > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > >  group G2, asks for memory and triggers reclaim and generates writes of
> > >  the file pages mapped by T1, then these writes should not be charged to
> > >  T2, hence blkio_cgroup pages.
> > >
> > >  But the flip side of this might be that group G2 is a low weight group
> > >  and probably too busy also right now, which will delay the write out
> > >  and possibly T2 will wait longer for memory to be allocated.
> 
> In order to avoid this wait, dm-ioband issues IO which has a page with
> PG_Reclaim as early as possible.
> 

So in above case IO is still charged to G2 but you keep a track if page is
PG_Reclaim then releae the this bio before other bios queued up in the
group?

> > > - At one point of time Andrew mentioned that buffered writes are generally a
> > >  big problem and one needs to map these to owner's group. Though I am not
> > >  very sure what specific problem he was referring to. Can we attribute
> > >  buffered writes to pdflush threads and move all pdflush threads in a
> > >  cgroup to limit system wide write out activity?
> 
> I think that buffered writes also should be controlled per cgroup as
> well as synchronous writes.
> 

But it is hard to achieve fairness for buffered writes becase we don't
create complete parallel IO paths and not necessarily higher weight
process dispatches more buffered writes to IO scheduler. (Due to page
cache buffered write logic).

So in some cases we might see buffered write fairness and in other cases
not. For example, run two dd processes in two groups doing buffered writes
and it is hard to achieve fairness between these.

That's why the idea that if we can't ensure Buffered write vs Buffered
write fairness in all the cases, then does it make sense to attribute
buffered writes to pdflush and put pdflush threads into a separate group
to limit system wide write out activity. 

> > > - Somebody also gave an example where there is a memory hogging process and
> > >  possibly pushes out some processes to swap. It does not sound fair to
> > >  charge those proccess for that swap writeout. These processes never
> > >  requested swap IO.
> 
> I think that swap writeouts should be charged to the memory hogging
> process, because the process consumes more resources and it should get
> a penalty.
> 

A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
mechanism and kernel's way of providing extended RAM. If we want to solve
the issue of memory hogging by a process then right way to solve is to use
memory controller and not by charging the process for IO activity.
Instead, proabably a more suitable way is to charge swap activity to root
group (where by default all the kernel related activity goes).   

> > > - If there are multiple buffered writers in the system, then those writers
> > >  can also be forced to writeout some pages to disk before they are
> > >  allowed to dirty more pages. As per the page cache design, any writer
> > >  can pick any inode and start writing out pages. So it can happen a
> > >  weight group task is writting out pages dirtied by a lower weight group
> > >  task. If, async bio is mapped to owner's group, it might happen that
> > >  higher weight group task might be made to sleep on lower weight group
> > >  task because request descriptors are all consumed up.
> 
> As mentioned above, in dm-ioband, the bio is charged to the page owner
> and issued immediately.

But you are doing it only for selected pages and not for all buffered
writes?

> 
> > > It looks like there does not seem to be a clean way which covers all the
> > > cases without issues. I am just trying to think, what is a simple way
> > > which covers most of the cases. Can we just stick to using submitting task
> > > context to determine a bio's group (as cfq does). Which can result in
> > > following.
> > >
> > > - Less code and reduced complexity.
> > >
> > > - Buffered writes will be charged to pdflush and its group. If one wish to
> > >  limit buffered write activity for pdflush, one can move all the pdflush
> > >  threads into a group and assign desired weight. Writes submitted in
> > >  process context will continue to be charged to that process irrespective
> > >  of the fact who dirtied that page.
> > 
> > What if we wanted to control buffered write activity per group? If a
> > group keeps dirtying pages, we wouldn't want it to dominate the disk
> > IO capacity at the expense of other cgroups (by dominating the writes
> > sent down by pdflush).
> 
> Yes, I think that is true.
> 

But anyway we are not able to gurantee this isolation in all the cases.
Again I go back to example of two dd threads doing buffered writes in two
groups.

I don't mind keeping it. Just wanted to make sure that we agree and
understand that keeping it does not mean that we get buffered write vs
buffered write isolation/fairness in all the cases.

> > > - swap activity will be charged to kswapd and its group. If swap writes
> > >  are coming from process context, it gets charged to process and its
> > >  group.
> > >
> > > - If one is worried about the case of one process being charged for write
> > >  out of file mapped by another process during reclaim, then we can
> > >  probably make use of memory controller and mount memory controller and
> > >  io controller together on same hierarchy. I am told that with memory
> > >  controller, group's memory will be reclaimed by the process requesting
> > >  more memory. If that's the case, then IO will automatically be charged
> > >  to right group if we use submitting task context.
> > >
> > > I just wanted to bring this point forward for more discussions to know
> > > what is the right thing to do? Use bio tracking or not.
> 
> Thanks for bringing it forward.
> 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-01 14:11             ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-01 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> > > Hi Rik,
> > >
> > > Thanks for reviewing the patches. I wanted to have better understanding of
> > > where all does it help to associate a bio to the group of process who
> > > created/owned the page. Hence few thoughts.
> > >
> > > When a bio is submitted to IO scheduler, it needs to determine the group
> > > bio belongs to and group which should be charged to. There seem to be two
> > > methods.
> > >
> > > - Attribute the bio to cgroup submitting process belongs to.
> > > - For async requests, track the original owner hence cgroup of the page
> > >  and charge that group for the bio.
> > >
> > > One can think of pros/cons of both the approaches.
> > >
> > > - The primary use case of tracking async context seems be that if a
> > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > >  group G2, asks for memory and triggers reclaim and generates writes of
> > >  the file pages mapped by T1, then these writes should not be charged to
> > >  T2, hence blkio_cgroup pages.
> > >
> > >  But the flip side of this might be that group G2 is a low weight group
> > >  and probably too busy also right now, which will delay the write out
> > >  and possibly T2 will wait longer for memory to be allocated.
> 
> In order to avoid this wait, dm-ioband issues IO which has a page with
> PG_Reclaim as early as possible.
> 

So in above case IO is still charged to G2 but you keep a track if page is
PG_Reclaim then releae the this bio before other bios queued up in the
group?

> > > - At one point of time Andrew mentioned that buffered writes are generally a
> > >  big problem and one needs to map these to owner's group. Though I am not
> > >  very sure what specific problem he was referring to. Can we attribute
> > >  buffered writes to pdflush threads and move all pdflush threads in a
> > >  cgroup to limit system wide write out activity?
> 
> I think that buffered writes also should be controlled per cgroup as
> well as synchronous writes.
> 

But it is hard to achieve fairness for buffered writes becase we don't
create complete parallel IO paths and not necessarily higher weight
process dispatches more buffered writes to IO scheduler. (Due to page
cache buffered write logic).

So in some cases we might see buffered write fairness and in other cases
not. For example, run two dd processes in two groups doing buffered writes
and it is hard to achieve fairness between these.

That's why the idea that if we can't ensure Buffered write vs Buffered
write fairness in all the cases, then does it make sense to attribute
buffered writes to pdflush and put pdflush threads into a separate group
to limit system wide write out activity. 

> > > - Somebody also gave an example where there is a memory hogging process and
> > >  possibly pushes out some processes to swap. It does not sound fair to
> > >  charge those proccess for that swap writeout. These processes never
> > >  requested swap IO.
> 
> I think that swap writeouts should be charged to the memory hogging
> process, because the process consumes more resources and it should get
> a penalty.
> 

A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
mechanism and kernel's way of providing extended RAM. If we want to solve
the issue of memory hogging by a process then right way to solve is to use
memory controller and not by charging the process for IO activity.
Instead, proabably a more suitable way is to charge swap activity to root
group (where by default all the kernel related activity goes).   

> > > - If there are multiple buffered writers in the system, then those writers
> > >  can also be forced to writeout some pages to disk before they are
> > >  allowed to dirty more pages. As per the page cache design, any writer
> > >  can pick any inode and start writing out pages. So it can happen a
> > >  weight group task is writting out pages dirtied by a lower weight group
> > >  task. If, async bio is mapped to owner's group, it might happen that
> > >  higher weight group task might be made to sleep on lower weight group
> > >  task because request descriptors are all consumed up.
> 
> As mentioned above, in dm-ioband, the bio is charged to the page owner
> and issued immediately.

But you are doing it only for selected pages and not for all buffered
writes?

> 
> > > It looks like there does not seem to be a clean way which covers all the
> > > cases without issues. I am just trying to think, what is a simple way
> > > which covers most of the cases. Can we just stick to using submitting task
> > > context to determine a bio's group (as cfq does). Which can result in
> > > following.
> > >
> > > - Less code and reduced complexity.
> > >
> > > - Buffered writes will be charged to pdflush and its group. If one wish to
> > >  limit buffered write activity for pdflush, one can move all the pdflush
> > >  threads into a group and assign desired weight. Writes submitted in
> > >  process context will continue to be charged to that process irrespective
> > >  of the fact who dirtied that page.
> > 
> > What if we wanted to control buffered write activity per group? If a
> > group keeps dirtying pages, we wouldn't want it to dominate the disk
> > IO capacity at the expense of other cgroups (by dominating the writes
> > sent down by pdflush).
> 
> Yes, I think that is true.
> 

But anyway we are not able to gurantee this isolation in all the cases.
Again I go back to example of two dd threads doing buffered writes in two
groups.

I don't mind keeping it. Just wanted to make sure that we agree and
understand that keeping it does not mean that we get buffered write vs
buffered write isolation/fairness in all the cases.

> > > - swap activity will be charged to kswapd and its group. If swap writes
> > >  are coming from process context, it gets charged to process and its
> > >  group.
> > >
> > > - If one is worried about the case of one process being charged for write
> > >  out of file mapped by another process during reclaim, then we can
> > >  probably make use of memory controller and mount memory controller and
> > >  io controller together on same hierarchy. I am told that with memory
> > >  controller, group's memory will be reclaimed by the process requesting
> > >  more memory. If that's the case, then IO will automatically be charged
> > >  to right group if we use submitting task context.
> > >
> > > I just wanted to bring this point forward for more discussions to know
> > > what is the right thing to do? Use bio tracking or not.
> 
> Thanks for bringing it forward.
> 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]             ` <20090901141142.GA13709-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-01 14:53               ` Rik van Riel
  2009-09-01 18:02               ` Nauman Rafique
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-01 14:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:

>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
> 
> A process requesting memory gets IO penalty? 

There is no easy answer here.

On the one hand, you want to charge the process that uses
the resources.

On the other hand, if a lower resource use / higher priority
process tries to free up some of those resources, it should
not have its IO requests penalized (and get slowed down)
because of something the first process did...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-01 14:11             ` Vivek Goyal
@ 2009-09-01 14:53               ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-01 14:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, nauman, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

Vivek Goyal wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:

>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
> 
> A process requesting memory gets IO penalty? 

There is no easy answer here.

On the one hand, you want to charge the process that uses
the resources.

On the other hand, if a lower resource use / higher priority
process tries to free up some of those resources, it should
not have its IO requests penalized (and get slowed down)
because of something the first process did...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-01 14:53               ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-01 14:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:

>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
> 
> A process requesting memory gets IO penalty? 

There is no easy answer here.

On the one hand, you want to charge the process that uses
the resources.

On the other hand, if a lower resource use / higher priority
process tries to free up some of those resources, it should
not have its IO requests penalized (and get slowed down)
because of something the first process did...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]             ` <20090901141142.GA13709-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-01 14:53               ` Rik van Riel
@ 2009-09-01 18:02               ` Nauman Rafique
  2009-09-02  0:59               ` KAMEZAWA Hiroyuki
  2009-09-02  9:52               ` Ryo Tsuruta
  3 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-01 18:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 1, 2009 at 7:11 AM, Vivek Goyal<vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
>> Hi,
>>
>> > > Hi Rik,
>> > >
>> > > Thanks for reviewing the patches. I wanted to have better understanding of
>> > > where all does it help to associate a bio to the group of process who
>> > > created/owned the page. Hence few thoughts.
>> > >
>> > > When a bio is submitted to IO scheduler, it needs to determine the group
>> > > bio belongs to and group which should be charged to. There seem to be two
>> > > methods.
>> > >
>> > > - Attribute the bio to cgroup submitting process belongs to.
>> > > - For async requests, track the original owner hence cgroup of the page
>> > > áand charge that group for the bio.
>> > >
>> > > One can think of pros/cons of both the approaches.
>> > >
>> > > - The primary use case of tracking async context seems be that if a
>> > > áprocess T1 in group G1 mmaps a big file and then another process T2 in
>> > > ágroup G2, asks for memory and triggers reclaim and generates writes of
>> > > áthe file pages mapped by T1, then these writes should not be charged to
>> > > áT2, hence blkio_cgroup pages.
>> > >
>> > > áBut the flip side of this might be that group G2 is a low weight group
>> > > áand probably too busy also right now, which will delay the write out
>> > > áand possibly T2 will wait longer for memory to be allocated.
>>
>> In order to avoid this wait, dm-ioband issues IO which has a page with
>> PG_Reclaim as early as possible.
>>
>
> So in above case IO is still charged to G2 but you keep a track if page is
> PG_Reclaim then releae the this bio before other bios queued up in the
> group?
>
>> > > - At one point of time Andrew mentioned that buffered writes are generally a
>> > > ábig problem and one needs to map these to owner's group. Though I am not
>> > > ávery sure what specific problem he was referring to. Can we attribute
>> > > ábuffered writes to pdflush threads and move all pdflush threads in a
>> > > ácgroup to limit system wide write out activity?
>>
>> I think that buffered writes also should be controlled per cgroup as
>> well as synchronous writes.
>>
>
> But it is hard to achieve fairness for buffered writes becase we don't
> create complete parallel IO paths and not necessarily higher weight
> process dispatches more buffered writes to IO scheduler. (Due to page
> cache buffered write logic).
>
> So in some cases we might see buffered write fairness and in other cases
> not. For example, run two dd processes in two groups doing buffered writes
> and it is hard to achieve fairness between these.

If something is broken, we don't necessarily have to break it further.
Instead, we should be thinking about why its hard to achieve fairness
with buffered write back. Is there a way to change the writeback path
to send down a constant stream of IO, instead of sending down bursts?

>
> That's why the idea that if we can't ensure Buffered write vs Buffered
> write fairness in all the cases, then does it make sense to attribute
> buffered writes to pdflush and put pdflush threads into a separate group
> to limit system wide write out activity.
>
>> > > - Somebody also gave an example where there is a memory hogging process and
>> > > ápossibly pushes out some processes to swap. It does not sound fair to
>> > > ácharge those proccess for that swap writeout. These processes never
>> > > árequested swap IO.
>>
>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
>>
>
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).
>
>> > > - If there are multiple buffered writers in the system, then those writers
>> > > ácan also be forced to writeout some pages to disk before they are
>> > > áallowed to dirty more pages. As per the page cache design, any writer
>> > > ácan pick any inode and start writing out pages. So it can happen a
>> > > áweight group task is writting out pages dirtied by a lower weight group
>> > > átask. If, async bio is mapped to owner's group, it might happen that
>> > > áhigher weight group task might be made to sleep on lower weight group
>> > > átask because request descriptors are all consumed up.
>>
>> As mentioned above, in dm-ioband, the bio is charged to the page owner
>> and issued immediately.
>
> But you are doing it only for selected pages and not for all buffered
> writes?
>
>>
>> > > It looks like there does not seem to be a clean way which covers all the
>> > > cases without issues. I am just trying to think, what is a simple way
>> > > which covers most of the cases. Can we just stick to using submitting task
>> > > context to determine a bio's group (as cfq does). Which can result in
>> > > following.
>> > >
>> > > - Less code and reduced complexity.
>> > >
>> > > - Buffered writes will be charged to pdflush and its group. If one wish to
>> > > álimit buffered write activity for pdflush, one can move all the pdflush
>> > > áthreads into a group and assign desired weight. Writes submitted in
>> > > áprocess context will continue to be charged to that process irrespective
>> > > áof the fact who dirtied that page.
>> >
>> > What if we wanted to control buffered write activity per group? If a
>> > group keeps dirtying pages, we wouldn't want it to dominate the disk
>> > IO capacity at the expense of other cgroups (by dominating the writes
>> > sent down by pdflush).
>>
>> Yes, I think that is true.
>>
>
> But anyway we are not able to gurantee this isolation in all the cases.
> Again I go back to example of two dd threads doing buffered writes in two
> groups.
>
> I don't mind keeping it. Just wanted to make sure that we agree and
> understand that keeping it does not mean that we get buffered write vs
> buffered write isolation/fairness in all the cases.
>
>> > > - swap activity will be charged to kswapd and its group. If swap writes
>> > > áare coming from process context, it gets charged to process and its
>> > > ágroup.
>> > >
>> > > - If one is worried about the case of one process being charged for write
>> > > áout of file mapped by another process during reclaim, then we can
>> > > áprobably make use of memory controller and mount memory controller and
>> > > áio controller together on same hierarchy. I am told that with memory
>> > > ácontroller, group's memory will be reclaimed by the process requesting
>> > > ámore memory. If that's the case, then IO will automatically be charged
>> > > áto right group if we use submitting task context.
>> > >
>> > > I just wanted to bring this point forward for more discussions to know
>> > > what is the right thing to do? Use bio tracking or not.
>>
>> Thanks for bringing it forward.
>>
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to  track async bios.
  2009-09-01 14:11             ` Vivek Goyal
@ 2009-09-01 18:02               ` Nauman Rafique
  -1 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-01 18:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, riel, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Tue, Sep 1, 2009 at 7:11 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
>> Hi,
>>
>> > > Hi Rik,
>> > >
>> > > Thanks for reviewing the patches. I wanted to have better understanding of
>> > > where all does it help to associate a bio to the group of process who
>> > > created/owned the page. Hence few thoughts.
>> > >
>> > > When a bio is submitted to IO scheduler, it needs to determine the group
>> > > bio belongs to and group which should be charged to. There seem to be two
>> > > methods.
>> > >
>> > > - Attribute the bio to cgroup submitting process belongs to.
>> > > - For async requests, track the original owner hence cgroup of the page
>> > > áand charge that group for the bio.
>> > >
>> > > One can think of pros/cons of both the approaches.
>> > >
>> > > - The primary use case of tracking async context seems be that if a
>> > > áprocess T1 in group G1 mmaps a big file and then another process T2 in
>> > > ágroup G2, asks for memory and triggers reclaim and generates writes of
>> > > áthe file pages mapped by T1, then these writes should not be charged to
>> > > áT2, hence blkio_cgroup pages.
>> > >
>> > > áBut the flip side of this might be that group G2 is a low weight group
>> > > áand probably too busy also right now, which will delay the write out
>> > > áand possibly T2 will wait longer for memory to be allocated.
>>
>> In order to avoid this wait, dm-ioband issues IO which has a page with
>> PG_Reclaim as early as possible.
>>
>
> So in above case IO is still charged to G2 but you keep a track if page is
> PG_Reclaim then releae the this bio before other bios queued up in the
> group?
>
>> > > - At one point of time Andrew mentioned that buffered writes are generally a
>> > > ábig problem and one needs to map these to owner's group. Though I am not
>> > > ávery sure what specific problem he was referring to. Can we attribute
>> > > ábuffered writes to pdflush threads and move all pdflush threads in a
>> > > ácgroup to limit system wide write out activity?
>>
>> I think that buffered writes also should be controlled per cgroup as
>> well as synchronous writes.
>>
>
> But it is hard to achieve fairness for buffered writes becase we don't
> create complete parallel IO paths and not necessarily higher weight
> process dispatches more buffered writes to IO scheduler. (Due to page
> cache buffered write logic).
>
> So in some cases we might see buffered write fairness and in other cases
> not. For example, run two dd processes in two groups doing buffered writes
> and it is hard to achieve fairness between these.

If something is broken, we don't necessarily have to break it further.
Instead, we should be thinking about why its hard to achieve fairness
with buffered write back. Is there a way to change the writeback path
to send down a constant stream of IO, instead of sending down bursts?

>
> That's why the idea that if we can't ensure Buffered write vs Buffered
> write fairness in all the cases, then does it make sense to attribute
> buffered writes to pdflush and put pdflush threads into a separate group
> to limit system wide write out activity.
>
>> > > - Somebody also gave an example where there is a memory hogging process and
>> > > ápossibly pushes out some processes to swap. It does not sound fair to
>> > > ácharge those proccess for that swap writeout. These processes never
>> > > árequested swap IO.
>>
>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
>>
>
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).
>
>> > > - If there are multiple buffered writers in the system, then those writers
>> > > ácan also be forced to writeout some pages to disk before they are
>> > > áallowed to dirty more pages. As per the page cache design, any writer
>> > > ácan pick any inode and start writing out pages. So it can happen a
>> > > áweight group task is writting out pages dirtied by a lower weight group
>> > > átask. If, async bio is mapped to owner's group, it might happen that
>> > > áhigher weight group task might be made to sleep on lower weight group
>> > > átask because request descriptors are all consumed up.
>>
>> As mentioned above, in dm-ioband, the bio is charged to the page owner
>> and issued immediately.
>
> But you are doing it only for selected pages and not for all buffered
> writes?
>
>>
>> > > It looks like there does not seem to be a clean way which covers all the
>> > > cases without issues. I am just trying to think, what is a simple way
>> > > which covers most of the cases. Can we just stick to using submitting task
>> > > context to determine a bio's group (as cfq does). Which can result in
>> > > following.
>> > >
>> > > - Less code and reduced complexity.
>> > >
>> > > - Buffered writes will be charged to pdflush and its group. If one wish to
>> > > álimit buffered write activity for pdflush, one can move all the pdflush
>> > > áthreads into a group and assign desired weight. Writes submitted in
>> > > áprocess context will continue to be charged to that process irrespective
>> > > áof the fact who dirtied that page.
>> >
>> > What if we wanted to control buffered write activity per group? If a
>> > group keeps dirtying pages, we wouldn't want it to dominate the disk
>> > IO capacity at the expense of other cgroups (by dominating the writes
>> > sent down by pdflush).
>>
>> Yes, I think that is true.
>>
>
> But anyway we are not able to gurantee this isolation in all the cases.
> Again I go back to example of two dd threads doing buffered writes in two
> groups.
>
> I don't mind keeping it. Just wanted to make sure that we agree and
> understand that keeping it does not mean that we get buffered write vs
> buffered write isolation/fairness in all the cases.
>
>> > > - swap activity will be charged to kswapd and its group. If swap writes
>> > > áare coming from process context, it gets charged to process and its
>> > > ágroup.
>> > >
>> > > - If one is worried about the case of one process being charged for write
>> > > áout of file mapped by another process during reclaim, then we can
>> > > áprobably make use of memory controller and mount memory controller and
>> > > áio controller together on same hierarchy. I am told that with memory
>> > > ácontroller, group's memory will be reclaimed by the process requesting
>> > > ámore memory. If that's the case, then IO will automatically be charged
>> > > áto right group if we use submitting task context.
>> > >
>> > > I just wanted to bring this point forward for more discussions to know
>> > > what is the right thing to do? Use bio tracking or not.
>>
>> Thanks for bringing it forward.
>>
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-01 18:02               ` Nauman Rafique
  0 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-01 18:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Tue, Sep 1, 2009 at 7:11 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Tue, Sep 01, 2009 at 04:00:04PM +0900, Ryo Tsuruta wrote:
>> Hi,
>>
>> > > Hi Rik,
>> > >
>> > > Thanks for reviewing the patches. I wanted to have better understanding of
>> > > where all does it help to associate a bio to the group of process who
>> > > created/owned the page. Hence few thoughts.
>> > >
>> > > When a bio is submitted to IO scheduler, it needs to determine the group
>> > > bio belongs to and group which should be charged to. There seem to be two
>> > > methods.
>> > >
>> > > - Attribute the bio to cgroup submitting process belongs to.
>> > > - For async requests, track the original owner hence cgroup of the page
>> > > áand charge that group for the bio.
>> > >
>> > > One can think of pros/cons of both the approaches.
>> > >
>> > > - The primary use case of tracking async context seems be that if a
>> > > áprocess T1 in group G1 mmaps a big file and then another process T2 in
>> > > ágroup G2, asks for memory and triggers reclaim and generates writes of
>> > > áthe file pages mapped by T1, then these writes should not be charged to
>> > > áT2, hence blkio_cgroup pages.
>> > >
>> > > áBut the flip side of this might be that group G2 is a low weight group
>> > > áand probably too busy also right now, which will delay the write out
>> > > áand possibly T2 will wait longer for memory to be allocated.
>>
>> In order to avoid this wait, dm-ioband issues IO which has a page with
>> PG_Reclaim as early as possible.
>>
>
> So in above case IO is still charged to G2 but you keep a track if page is
> PG_Reclaim then releae the this bio before other bios queued up in the
> group?
>
>> > > - At one point of time Andrew mentioned that buffered writes are generally a
>> > > ábig problem and one needs to map these to owner's group. Though I am not
>> > > ávery sure what specific problem he was referring to. Can we attribute
>> > > ábuffered writes to pdflush threads and move all pdflush threads in a
>> > > ácgroup to limit system wide write out activity?
>>
>> I think that buffered writes also should be controlled per cgroup as
>> well as synchronous writes.
>>
>
> But it is hard to achieve fairness for buffered writes becase we don't
> create complete parallel IO paths and not necessarily higher weight
> process dispatches more buffered writes to IO scheduler. (Due to page
> cache buffered write logic).
>
> So in some cases we might see buffered write fairness and in other cases
> not. For example, run two dd processes in two groups doing buffered writes
> and it is hard to achieve fairness between these.

If something is broken, we don't necessarily have to break it further.
Instead, we should be thinking about why its hard to achieve fairness
with buffered write back. Is there a way to change the writeback path
to send down a constant stream of IO, instead of sending down bursts?

>
> That's why the idea that if we can't ensure Buffered write vs Buffered
> write fairness in all the cases, then does it make sense to attribute
> buffered writes to pdflush and put pdflush threads into a separate group
> to limit system wide write out activity.
>
>> > > - Somebody also gave an example where there is a memory hogging process and
>> > > ápossibly pushes out some processes to swap. It does not sound fair to
>> > > ácharge those proccess for that swap writeout. These processes never
>> > > árequested swap IO.
>>
>> I think that swap writeouts should be charged to the memory hogging
>> process, because the process consumes more resources and it should get
>> a penalty.
>>
>
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).
>
>> > > - If there are multiple buffered writers in the system, then those writers
>> > > ácan also be forced to writeout some pages to disk before they are
>> > > áallowed to dirty more pages. As per the page cache design, any writer
>> > > ácan pick any inode and start writing out pages. So it can happen a
>> > > áweight group task is writting out pages dirtied by a lower weight group
>> > > átask. If, async bio is mapped to owner's group, it might happen that
>> > > áhigher weight group task might be made to sleep on lower weight group
>> > > átask because request descriptors are all consumed up.
>>
>> As mentioned above, in dm-ioband, the bio is charged to the page owner
>> and issued immediately.
>
> But you are doing it only for selected pages and not for all buffered
> writes?
>
>>
>> > > It looks like there does not seem to be a clean way which covers all the
>> > > cases without issues. I am just trying to think, what is a simple way
>> > > which covers most of the cases. Can we just stick to using submitting task
>> > > context to determine a bio's group (as cfq does). Which can result in
>> > > following.
>> > >
>> > > - Less code and reduced complexity.
>> > >
>> > > - Buffered writes will be charged to pdflush and its group. If one wish to
>> > > álimit buffered write activity for pdflush, one can move all the pdflush
>> > > áthreads into a group and assign desired weight. Writes submitted in
>> > > áprocess context will continue to be charged to that process irrespective
>> > > áof the fact who dirtied that page.
>> >
>> > What if we wanted to control buffered write activity per group? If a
>> > group keeps dirtying pages, we wouldn't want it to dominate the disk
>> > IO capacity at the expense of other cgroups (by dominating the writes
>> > sent down by pdflush).
>>
>> Yes, I think that is true.
>>
>
> But anyway we are not able to gurantee this isolation in all the cases.
> Again I go back to example of two dd threads doing buffered writes in two
> groups.
>
> I don't mind keeping it. Just wanted to make sure that we agree and
> understand that keeping it does not mean that we get buffered write vs
> buffered write isolation/fairness in all the cases.
>
>> > > - swap activity will be charged to kswapd and its group. If swap writes
>> > > áare coming from process context, it gets charged to process and its
>> > > ágroup.
>> > >
>> > > - If one is worried about the case of one process being charged for write
>> > > áout of file mapped by another process during reclaim, then we can
>> > > áprobably make use of memory controller and mount memory controller and
>> > > áio controller together on same hierarchy. I am told that with memory
>> > > ácontroller, group's memory will be reclaimed by the process requesting
>> > > ámore memory. If that's the case, then IO will automatically be charged
>> > > áto right group if we use submitting task context.
>> > >
>> > > I just wanted to bring this point forward for more discussions to know
>> > > what is the right thing to do? Use bio tracking or not.
>>
>> Thanks for bringing it forward.
>>
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (23 preceding siblings ...)
  2009-08-31  1:09   ` [RFC] IO scheduler based IO controller V9 Gui Jianfeng
@ 2009-09-02  0:58   ` Gui Jianfeng
  2009-09-07  7:40   ` Gui Jianfeng
                     ` (5 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-02  0:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Hi Vivek,

Here are some performance numbers generated by fio test. It seems V9 performance is better
than V8 especially for write case.

                         Normal Read  |  Random Read  |  Normal Write  |  Random Write

V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 

V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s

Performance              +2.0%           +4.1%           +4.2%            +4.5%

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-09-02  0:58   ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-02  0:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Hi Vivek,

Here are some performance numbers generated by fio test. It seems V9 performance is better
than V8 especially for write case.

                         Normal Read  |  Random Read  |  Normal Write  |  Random Write

V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 

V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s

Performance              +2.0%           +4.1%           +4.2%            +4.5%




^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-02  0:58   ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-02  0:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 
> Changes from V8
> ===============
> - Implemented bdi like congestion semantics for io group also. Now once an
>   io group gets congested, we don't clear the congestion flag until number
>   of requests goes below nr_congestion_off.
> 
>   This helps in getting rid of Buffered write performance regression we
>   were observing with io controller patches.
> 
>   Gui, can you please test it and see if this version is better in terms
>   of your buffered write tests.

Hi Vivek,

Here are some performance numbers generated by fio test. It seems V9 performance is better
than V8 especially for write case.

                         Normal Read  |  Random Read  |  Normal Write  |  Random Write

V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 

V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s

Performance              +2.0%           +4.1%           +4.2%            +4.5%

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]             ` <20090901141142.GA13709-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-01 14:53               ` Rik van Riel
  2009-09-01 18:02               ` Nauman Rafique
@ 2009-09-02  0:59               ` KAMEZAWA Hiroyuki
  2009-09-02  9:52               ` Ryo Tsuruta
  3 siblings, 0 replies; 322+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-02  0:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, 1 Sep 2009 10:11:42 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > - Somebody also gave an example where there is a memory hogging process and
> > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > >  charge those proccess for that swap writeout. These processes never
> > > >  requested swap IO.
> > 
> > I think that swap writeouts should be charged to the memory hogging
> > process, because the process consumes more resources and it should get
> > a penalty.
> > 
> 
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).   
> 

I agree. It't memcg's job.
(Support dirty_ratio in memcg is necessary, I think)

background-write-out-to-swap-for-memory-shortage should be handled
as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
dirty_ratio for memcg should be implemetned.

Thanks,
-Kame

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-01 14:11             ` Vivek Goyal
@ 2009-09-02  0:59               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 322+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-02  0:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, dhaval, dm-devel, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, jmoyer, mingo, riel,
	fchecconi, containers, linux-kernel, akpm, righi.andrea,
	torvalds

On Tue, 1 Sep 2009 10:11:42 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > - Somebody also gave an example where there is a memory hogging process and
> > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > >  charge those proccess for that swap writeout. These processes never
> > > >  requested swap IO.
> > 
> > I think that swap writeouts should be charged to the memory hogging
> > process, because the process consumes more resources and it should get
> > a penalty.
> > 
> 
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).   
> 

I agree. It't memcg's job.
(Support dirty_ratio in memcg is necessary, I think)

background-write-out-to-swap-for-memory-shortage should be handled
as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
dirty_ratio for memcg should be implemetned.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-02  0:59               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 322+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-02  0:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente, jmarchan, dhaval, riel, fernando, jmoyer, akpm,
	linux-kernel, fchecconi, dm-devel, jens.axboe, mingo,
	righi.andrea, torvalds, containers, agk, balbir

On Tue, 1 Sep 2009 10:11:42 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > - Somebody also gave an example where there is a memory hogging process and
> > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > >  charge those proccess for that swap writeout. These processes never
> > > >  requested swap IO.
> > 
> > I think that swap writeouts should be charged to the memory hogging
> > process, because the process consumes more resources and it should get
> > a penalty.
> > 
> 
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).   
> 

I agree. It't memcg's job.
(Support dirty_ratio in memcg is necessary, I think)

background-write-out-to-swap-for-memory-shortage should be handled
as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
dirty_ratio for memcg should be implemetned.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]               ` <20090902095912.cdf8a55e.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2009-09-02  3:12                 ` Balbir Singh
  0 siblings, 0 replies; 322+ messages in thread
From: Balbir Singh @ 2009-09-02  3:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, Sep 2, 2009 at 6:29 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> On Tue, 1 Sep 2009 10:11:42 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > > > - Somebody also gave an example where there is a memory hogging process and
>> > > >  possibly pushes out some processes to swap. It does not sound fair to
>> > > >  charge those proccess for that swap writeout. These processes never
>> > > >  requested swap IO.
>> >
>> > I think that swap writeouts should be charged to the memory hogging
>> > process, because the process consumes more resources and it should get
>> > a penalty.
>> >
>>
>> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
>> mechanism and kernel's way of providing extended RAM. If we want to solve
>> the issue of memory hogging by a process then right way to solve is to use
>> memory controller and not by charging the process for IO activity.
>> Instead, proabably a more suitable way is to charge swap activity to root
>> group (where by default all the kernel related activity goes).
>>
>
> I agree. It't memcg's job.
> (Support dirty_ratio in memcg is necessary, I think)
>
> background-write-out-to-swap-for-memory-shortage should be handled
> as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
> dirty_ratio for memcg should be implemetned.

I tend to agree, looks like dirty_ratio will become important along
with overcommit support in the future.

Balbir Singh.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to  track async bios.
  2009-09-02  0:59               ` KAMEZAWA Hiroyuki
@ 2009-09-02  3:12                 ` Balbir Singh
  -1 siblings, 0 replies; 322+ messages in thread
From: Balbir Singh @ 2009-09-02  3:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Ryo Tsuruta, dhaval, dm-devel, jens.axboe, agk,
	paolo.valente, jmarchan, fernando, jmoyer, mingo, riel,
	fchecconi, containers, linux-kernel, akpm, righi.andrea,
	torvalds

On Wed, Sep 2, 2009 at 6:29 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 1 Sep 2009 10:11:42 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>> > > > - Somebody also gave an example where there is a memory hogging process and
>> > > >  possibly pushes out some processes to swap. It does not sound fair to
>> > > >  charge those proccess for that swap writeout. These processes never
>> > > >  requested swap IO.
>> >
>> > I think that swap writeouts should be charged to the memory hogging
>> > process, because the process consumes more resources and it should get
>> > a penalty.
>> >
>>
>> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
>> mechanism and kernel's way of providing extended RAM. If we want to solve
>> the issue of memory hogging by a process then right way to solve is to use
>> memory controller and not by charging the process for IO activity.
>> Instead, proabably a more suitable way is to charge swap activity to root
>> group (where by default all the kernel related activity goes).
>>
>
> I agree. It't memcg's job.
> (Support dirty_ratio in memcg is necessary, I think)
>
> background-write-out-to-swap-for-memory-shortage should be handled
> as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
> dirty_ratio for memcg should be implemetned.

I tend to agree, looks like dirty_ratio will become important along
with overcommit support in the future.

Balbir Singh.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-02  3:12                 ` Balbir Singh
  0 siblings, 0 replies; 322+ messages in thread
From: Balbir Singh @ 2009-09-02  3:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Ryo Tsuruta, dhaval, dm-devel, jens.axboe, agk,
	paolo.valente, jmarchan, fernando, jmoyer, mingo, riel,
	fchecconi, containers, linux-kernel, akpm, righi.andrea,
	torvalds

On Wed, Sep 2, 2009 at 6:29 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 1 Sep 2009 10:11:42 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>> > > > - Somebody also gave an example where there is a memory hogging process and
>> > > >  possibly pushes out some processes to swap. It does not sound fair to
>> > > >  charge those proccess for that swap writeout. These processes never
>> > > >  requested swap IO.
>> >
>> > I think that swap writeouts should be charged to the memory hogging
>> > process, because the process consumes more resources and it should get
>> > a penalty.
>> >
>>
>> A process requesting memory gets IO penalty? IMHO, swapping is a kernel
>> mechanism and kernel's way of providing extended RAM. If we want to solve
>> the issue of memory hogging by a process then right way to solve is to use
>> memory controller and not by charging the process for IO activity.
>> Instead, proabably a more suitable way is to charge swap activity to root
>> group (where by default all the kernel related activity goes).
>>
>
> I agree. It't memcg's job.
> (Support dirty_ratio in memcg is necessary, I think)
>
> background-write-out-to-swap-for-memory-shortage should be handled
> as kernel I/O. If swap-out-by-memcg bacause of its limit is a problem,
> dirty_ratio for memcg should be implemetned.

I tend to agree, looks like dirty_ratio will become important along
with overcommit support in the future.

Balbir Singh.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]             ` <20090901141142.GA13709-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                                 ` (2 preceding siblings ...)
  2009-09-02  0:59               ` KAMEZAWA Hiroyuki
@ 2009-09-02  9:52               ` Ryo Tsuruta
  3 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-02  9:52 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

> > > > - The primary use case of tracking async context seems be that if a
> > > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > > >  group G2, asks for memory and triggers reclaim and generates writes of
> > > >  the file pages mapped by T1, then these writes should not be charged to
> > > >  T2, hence blkio_cgroup pages.
> > > >
> > > >  But the flip side of this might be that group G2 is a low weight group
> > > >  and probably too busy also right now, which will delay the write out
> > > >  and possibly T2 will wait longer for memory to be allocated.
> > 
> > In order to avoid this wait, dm-ioband issues IO which has a page with
> > PG_Reclaim as early as possible.
> > 
> 
> So in above case IO is still charged to G2 but you keep a track if page is
> PG_Reclaim then releae the this bio before other bios queued up in the
> group?

Yes, the bio with PG_Reclaim page is given priority over the other bios.

> > > > - At one point of time Andrew mentioned that buffered writes are generally a
> > > >  big problem and one needs to map these to owner's group. Though I am not
> > > >  very sure what specific problem he was referring to. Can we attribute
> > > >  buffered writes to pdflush threads and move all pdflush threads in a
> > > >  cgroup to limit system wide write out activity?
> > 
> > I think that buffered writes also should be controlled per cgroup as
> > well as synchronous writes.
> > 
> 
> But it is hard to achieve fairness for buffered writes becase we don't
> create complete parallel IO paths and not necessarily higher weight
> process dispatches more buffered writes to IO scheduler. (Due to page
> cache buffered write logic).
> 
> So in some cases we might see buffered write fairness and in other cases
> not. For example, run two dd processes in two groups doing buffered writes
> and it is hard to achieve fairness between these.
> 
> That's why the idea that if we can't ensure Buffered write vs Buffered
> write fairness in all the cases, then does it make sense to attribute
> buffered writes to pdflush and put pdflush threads into a separate group
> to limit system wide write out activity. 

If all buffered writes are treated as system wide activities, it does
not mean that bandwidth is being controlled. It is true that pdflush
doesn't do I/O according to weight, but bandwidth (including for
bufferd writes) should be reserved for each cgroup.

> > > > - Somebody also gave an example where there is a memory hogging process and
> > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > >  charge those proccess for that swap writeout. These processes never
> > > >  requested swap IO.
> > 
> > I think that swap writeouts should be charged to the memory hogging
> > process, because the process consumes more resources and it should get
> > a penalty.
> > 
> 
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).   

No. In the current blkio-cgroup, a process which uses a large amount
of memory gets penalty, not a memory requester.

As you wrote, using both io-controller and memory controller are
required to prevent swap-out caused by memory consumption on another
cgroup.

> > > > - If there are multiple buffered writers in the system, then those writers
> > > >  can also be forced to writeout some pages to disk before they are
> > > >  allowed to dirty more pages. As per the page cache design, any writer
> > > >  can pick any inode and start writing out pages. So it can happen a
> > > >  weight group task is writting out pages dirtied by a lower weight group
> > > >  task. If, async bio is mapped to owner's group, it might happen that
> > > >  higher weight group task might be made to sleep on lower weight group
> > > >  task because request descriptors are all consumed up.
> > 
> > As mentioned above, in dm-ioband, the bio is charged to the page owner
> > and issued immediately.
> 
> But you are doing it only for selected pages and not for all buffered
> writes?

I'm sorry, I wrote wrong on the previous mail, IO for writing out
page-cache pages is not issued immediately, it is throttled by
dm-ioband.

Anyway, there is a case where a higher weight group task is made
to sleep, but if we reserve the memory for each cgroup by memory
controller in advance, we can avoid the task put to sleep.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-01 14:11             ` Vivek Goyal
                               ` (4 preceding siblings ...)
  (?)
@ 2009-09-02  9:52             ` Ryo Tsuruta
       [not found]               ` <20090902.185251.193693849.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-02 13:58                 ` Vivek Goyal
  -1 siblings, 2 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-02  9:52 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

Hi Vivek,

> > > > - The primary use case of tracking async context seems be that if a
> > > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > > >  group G2, asks for memory and triggers reclaim and generates writes of
> > > >  the file pages mapped by T1, then these writes should not be charged to
> > > >  T2, hence blkio_cgroup pages.
> > > >
> > > >  But the flip side of this might be that group G2 is a low weight group
> > > >  and probably too busy also right now, which will delay the write out
> > > >  and possibly T2 will wait longer for memory to be allocated.
> > 
> > In order to avoid this wait, dm-ioband issues IO which has a page with
> > PG_Reclaim as early as possible.
> > 
> 
> So in above case IO is still charged to G2 but you keep a track if page is
> PG_Reclaim then releae the this bio before other bios queued up in the
> group?

Yes, the bio with PG_Reclaim page is given priority over the other bios.

> > > > - At one point of time Andrew mentioned that buffered writes are generally a
> > > >  big problem and one needs to map these to owner's group. Though I am not
> > > >  very sure what specific problem he was referring to. Can we attribute
> > > >  buffered writes to pdflush threads and move all pdflush threads in a
> > > >  cgroup to limit system wide write out activity?
> > 
> > I think that buffered writes also should be controlled per cgroup as
> > well as synchronous writes.
> > 
> 
> But it is hard to achieve fairness for buffered writes becase we don't
> create complete parallel IO paths and not necessarily higher weight
> process dispatches more buffered writes to IO scheduler. (Due to page
> cache buffered write logic).
> 
> So in some cases we might see buffered write fairness and in other cases
> not. For example, run two dd processes in two groups doing buffered writes
> and it is hard to achieve fairness between these.
> 
> That's why the idea that if we can't ensure Buffered write vs Buffered
> write fairness in all the cases, then does it make sense to attribute
> buffered writes to pdflush and put pdflush threads into a separate group
> to limit system wide write out activity. 

If all buffered writes are treated as system wide activities, it does
not mean that bandwidth is being controlled. It is true that pdflush
doesn't do I/O according to weight, but bandwidth (including for
bufferd writes) should be reserved for each cgroup.

> > > > - Somebody also gave an example where there is a memory hogging process and
> > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > >  charge those proccess for that swap writeout. These processes never
> > > >  requested swap IO.
> > 
> > I think that swap writeouts should be charged to the memory hogging
> > process, because the process consumes more resources and it should get
> > a penalty.
> > 
> 
> A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> mechanism and kernel's way of providing extended RAM. If we want to solve
> the issue of memory hogging by a process then right way to solve is to use
> memory controller and not by charging the process for IO activity.
> Instead, proabably a more suitable way is to charge swap activity to root
> group (where by default all the kernel related activity goes).   

No. In the current blkio-cgroup, a process which uses a large amount
of memory gets penalty, not a memory requester.

As you wrote, using both io-controller and memory controller are
required to prevent swap-out caused by memory consumption on another
cgroup.

> > > > - If there are multiple buffered writers in the system, then those writers
> > > >  can also be forced to writeout some pages to disk before they are
> > > >  allowed to dirty more pages. As per the page cache design, any writer
> > > >  can pick any inode and start writing out pages. So it can happen a
> > > >  weight group task is writting out pages dirtied by a lower weight group
> > > >  task. If, async bio is mapped to owner's group, it might happen that
> > > >  higher weight group task might be made to sleep on lower weight group
> > > >  task because request descriptors are all consumed up.
> > 
> > As mentioned above, in dm-ioband, the bio is charged to the page owner
> > and issued immediately.
> 
> But you are doing it only for selected pages and not for all buffered
> writes?

I'm sorry, I wrote wrong on the previous mail, IO for writing out
page-cache pages is not issued immediately, it is throttled by
dm-ioband.

Anyway, there is a case where a higher weight group task is made
to sleep, but if we reserve the memory for each cgroup by memory
controller in advance, we can avoid the task put to sleep.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4A9DC33E.6000408-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-02 13:45     ` Vivek Goyal
  2009-09-07  2:14     ` Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:45 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 02, 2009 at 08:58:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > 
> > For ease of patching, a consolidated patch is available here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> > 
> > Changes from V8
> > ===============
> > - Implemented bdi like congestion semantics for io group also. Now once an
> >   io group gets congested, we don't clear the congestion flag until number
> >   of requests goes below nr_congestion_off.
> > 
> >   This helps in getting rid of Buffered write performance regression we
> >   were observing with io controller patches.
> > 
> >   Gui, can you please test it and see if this version is better in terms
> >   of your buffered write tests.
> 
> Hi Vivek,
> 
> Here are some performance numbers generated by fio test. It seems V9 performance is better
> than V8 especially for write case.
> 
>                          Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 
> 
> V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s
> 
> Performance              +2.0%           +4.1%           +4.2%            +4.5%

Thanks Gui. I was also keen to know how does the vanilla kernel and V9
comparision look like? Can you please run the same tests with vanilla
kernel also?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-02  0:58   ` Gui Jianfeng
@ 2009-09-02 13:45     ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:45 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 02, 2009 at 08:58:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > 
> > For ease of patching, a consolidated patch is available here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> > 
> > Changes from V8
> > ===============
> > - Implemented bdi like congestion semantics for io group also. Now once an
> >   io group gets congested, we don't clear the congestion flag until number
> >   of requests goes below nr_congestion_off.
> > 
> >   This helps in getting rid of Buffered write performance regression we
> >   were observing with io controller patches.
> > 
> >   Gui, can you please test it and see if this version is better in terms
> >   of your buffered write tests.
> 
> Hi Vivek,
> 
> Here are some performance numbers generated by fio test. It seems V9 performance is better
> than V8 especially for write case.
> 
>                          Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 
> 
> V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s
> 
> Performance              +2.0%           +4.1%           +4.2%            +4.5%

Thanks Gui. I was also keen to know how does the vanilla kernel and V9
comparision look like? Can you please run the same tests with vanilla
kernel also?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-02 13:45     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:45 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 02, 2009 at 08:58:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > 
> > For ease of patching, a consolidated patch is available here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> > 
> > Changes from V8
> > ===============
> > - Implemented bdi like congestion semantics for io group also. Now once an
> >   io group gets congested, we don't clear the congestion flag until number
> >   of requests goes below nr_congestion_off.
> > 
> >   This helps in getting rid of Buffered write performance regression we
> >   were observing with io controller patches.
> > 
> >   Gui, can you please test it and see if this version is better in terms
> >   of your buffered write tests.
> 
> Hi Vivek,
> 
> Here are some performance numbers generated by fio test. It seems V9 performance is better
> than V8 especially for write case.
> 
>                          Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> V8(Avg of 3 timers)      64667 KiB/s     3387 KiB/s      59197 KiB/s      9327 KiB/s 
> 
> V9(Avg of 3 timers)      65947 KiB/s     3528 KiB/s      61654 KiB/s      9744 KiB/s
> 
> Performance              +2.0%           +4.1%           +4.2%            +4.5%

Thanks Gui. I was also keen to know how does the vanilla kernel and V9
comparision look like? Can you please run the same tests with vanilla
kernel also?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]               ` <20090902.185251.193693849.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-02 13:58                 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:58 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 02, 2009 at 06:52:51PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > > > > - The primary use case of tracking async context seems be that if a
> > > > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > > > >  group G2, asks for memory and triggers reclaim and generates writes of
> > > > >  the file pages mapped by T1, then these writes should not be charged to
> > > > >  T2, hence blkio_cgroup pages.
> > > > >
> > > > >  But the flip side of this might be that group G2 is a low weight group
> > > > >  and probably too busy also right now, which will delay the write out
> > > > >  and possibly T2 will wait longer for memory to be allocated.
> > > 
> > > In order to avoid this wait, dm-ioband issues IO which has a page with
> > > PG_Reclaim as early as possible.
> > > 
> > 
> > So in above case IO is still charged to G2 but you keep a track if page is
> > PG_Reclaim then releae the this bio before other bios queued up in the
> > group?
> 
> Yes, the bio with PG_Reclaim page is given priority over the other bios.
> 
> > > > > - At one point of time Andrew mentioned that buffered writes are generally a
> > > > >  big problem and one needs to map these to owner's group. Though I am not
> > > > >  very sure what specific problem he was referring to. Can we attribute
> > > > >  buffered writes to pdflush threads and move all pdflush threads in a
> > > > >  cgroup to limit system wide write out activity?
> > > 
> > > I think that buffered writes also should be controlled per cgroup as
> > > well as synchronous writes.
> > > 
> > 
> > But it is hard to achieve fairness for buffered writes becase we don't
> > create complete parallel IO paths and not necessarily higher weight
> > process dispatches more buffered writes to IO scheduler. (Due to page
> > cache buffered write logic).
> > 
> > So in some cases we might see buffered write fairness and in other cases
> > not. For example, run two dd processes in two groups doing buffered writes
> > and it is hard to achieve fairness between these.
> > 
> > That's why the idea that if we can't ensure Buffered write vs Buffered
> > write fairness in all the cases, then does it make sense to attribute
> > buffered writes to pdflush and put pdflush threads into a separate group
> > to limit system wide write out activity. 
> 
> If all buffered writes are treated as system wide activities, it does
> not mean that bandwidth is being controlled. It is true that pdflush
> doesn't do I/O according to weight, but bandwidth (including for
> bufferd writes) should be reserved for each cgroup.
> 
> > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > >  charge those proccess for that swap writeout. These processes never
> > > > >  requested swap IO.
> > > 
> > > I think that swap writeouts should be charged to the memory hogging
> > > process, because the process consumes more resources and it should get
> > > a penalty.
> > > 
> > 
> > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > mechanism and kernel's way of providing extended RAM. If we want to solve
> > the issue of memory hogging by a process then right way to solve is to use
> > memory controller and not by charging the process for IO activity.
> > Instead, proabably a more suitable way is to charge swap activity to root
> > group (where by default all the kernel related activity goes).   
> 
> No. In the current blkio-cgroup, a process which uses a large amount
> of memory gets penalty, not a memory requester.
> 

At ioband level you just get to see bio and page. How do you decide wheter
this bio is being issued by a process which is a memory hog?

In fact requester of memory could be anybody. It could be memory hog or a
different process. So are you saying that you got a mechanism where you 
can detect that a process is memory hog and charge swap activity to it.
IOW, if there are two processes A and B and assume A is the memory hog and
then B requests for memory which triggers lot of swap IO, then you can
charge all that IO to memory hog A?

Can you please point me to the relevant code in dm-ioband?

IMHO, to keep things simple, all swapping activity should be charged to
root group and be considered as kernel activity and user space not be
charged for that.

Thanks
Vivek

> As you wrote, using both io-controller and memory controller are
> required to prevent swap-out caused by memory consumption on another
> cgroup.
> 
> > > > > - If there are multiple buffered writers in the system, then those writers
> > > > >  can also be forced to writeout some pages to disk before they are
> > > > >  allowed to dirty more pages. As per the page cache design, any writer
> > > > >  can pick any inode and start writing out pages. So it can happen a
> > > > >  weight group task is writting out pages dirtied by a lower weight group
> > > > >  task. If, async bio is mapped to owner's group, it might happen that
> > > > >  higher weight group task might be made to sleep on lower weight group
> > > > >  task because request descriptors are all consumed up.
> > > 
> > > As mentioned above, in dm-ioband, the bio is charged to the page owner
> > > and issued immediately.
> > 
> > But you are doing it only for selected pages and not for all buffered
> > writes?
> 
> I'm sorry, I wrote wrong on the previous mail, IO for writing out
> page-cache pages is not issued immediately, it is throttled by
> dm-ioband.
> 
> Anyway, there is a case where a higher weight group task is made
> to sleep, but if we reserve the memory for each cgroup by memory
> controller in advance, we can avoid the task put to sleep.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-02  9:52             ` Ryo Tsuruta
@ 2009-09-02 13:58                 ` Vivek Goyal
  2009-09-02 13:58                 ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:58 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Wed, Sep 02, 2009 at 06:52:51PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > > > > - The primary use case of tracking async context seems be that if a
> > > > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > > > >  group G2, asks for memory and triggers reclaim and generates writes of
> > > > >  the file pages mapped by T1, then these writes should not be charged to
> > > > >  T2, hence blkio_cgroup pages.
> > > > >
> > > > >  But the flip side of this might be that group G2 is a low weight group
> > > > >  and probably too busy also right now, which will delay the write out
> > > > >  and possibly T2 will wait longer for memory to be allocated.
> > > 
> > > In order to avoid this wait, dm-ioband issues IO which has a page with
> > > PG_Reclaim as early as possible.
> > > 
> > 
> > So in above case IO is still charged to G2 but you keep a track if page is
> > PG_Reclaim then releae the this bio before other bios queued up in the
> > group?
> 
> Yes, the bio with PG_Reclaim page is given priority over the other bios.
> 
> > > > > - At one point of time Andrew mentioned that buffered writes are generally a
> > > > >  big problem and one needs to map these to owner's group. Though I am not
> > > > >  very sure what specific problem he was referring to. Can we attribute
> > > > >  buffered writes to pdflush threads and move all pdflush threads in a
> > > > >  cgroup to limit system wide write out activity?
> > > 
> > > I think that buffered writes also should be controlled per cgroup as
> > > well as synchronous writes.
> > > 
> > 
> > But it is hard to achieve fairness for buffered writes becase we don't
> > create complete parallel IO paths and not necessarily higher weight
> > process dispatches more buffered writes to IO scheduler. (Due to page
> > cache buffered write logic).
> > 
> > So in some cases we might see buffered write fairness and in other cases
> > not. For example, run two dd processes in two groups doing buffered writes
> > and it is hard to achieve fairness between these.
> > 
> > That's why the idea that if we can't ensure Buffered write vs Buffered
> > write fairness in all the cases, then does it make sense to attribute
> > buffered writes to pdflush and put pdflush threads into a separate group
> > to limit system wide write out activity. 
> 
> If all buffered writes are treated as system wide activities, it does
> not mean that bandwidth is being controlled. It is true that pdflush
> doesn't do I/O according to weight, but bandwidth (including for
> bufferd writes) should be reserved for each cgroup.
> 
> > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > >  charge those proccess for that swap writeout. These processes never
> > > > >  requested swap IO.
> > > 
> > > I think that swap writeouts should be charged to the memory hogging
> > > process, because the process consumes more resources and it should get
> > > a penalty.
> > > 
> > 
> > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > mechanism and kernel's way of providing extended RAM. If we want to solve
> > the issue of memory hogging by a process then right way to solve is to use
> > memory controller and not by charging the process for IO activity.
> > Instead, proabably a more suitable way is to charge swap activity to root
> > group (where by default all the kernel related activity goes).   
> 
> No. In the current blkio-cgroup, a process which uses a large amount
> of memory gets penalty, not a memory requester.
> 

At ioband level you just get to see bio and page. How do you decide wheter
this bio is being issued by a process which is a memory hog?

In fact requester of memory could be anybody. It could be memory hog or a
different process. So are you saying that you got a mechanism where you 
can detect that a process is memory hog and charge swap activity to it.
IOW, if there are two processes A and B and assume A is the memory hog and
then B requests for memory which triggers lot of swap IO, then you can
charge all that IO to memory hog A?

Can you please point me to the relevant code in dm-ioband?

IMHO, to keep things simple, all swapping activity should be charged to
root group and be considered as kernel activity and user space not be
charged for that.

Thanks
Vivek

> As you wrote, using both io-controller and memory controller are
> required to prevent swap-out caused by memory consumption on another
> cgroup.
> 
> > > > > - If there are multiple buffered writers in the system, then those writers
> > > > >  can also be forced to writeout some pages to disk before they are
> > > > >  allowed to dirty more pages. As per the page cache design, any writer
> > > > >  can pick any inode and start writing out pages. So it can happen a
> > > > >  weight group task is writting out pages dirtied by a lower weight group
> > > > >  task. If, async bio is mapped to owner's group, it might happen that
> > > > >  higher weight group task might be made to sleep on lower weight group
> > > > >  task because request descriptors are all consumed up.
> > > 
> > > As mentioned above, in dm-ioband, the bio is charged to the page owner
> > > and issued immediately.
> > 
> > But you are doing it only for selected pages and not for all buffered
> > writes?
> 
> I'm sorry, I wrote wrong on the previous mail, IO for writing out
> page-cache pages is not issued immediately, it is throttled by
> dm-ioband.
> 
> Anyway, there is a case where a higher weight group task is made
> to sleep, but if we reserve the memory for each cgroup by memory
> controller in advance, we can avoid the task put to sleep.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-02 13:58                 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-02 13:58 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 02, 2009 at 06:52:51PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > > > > - The primary use case of tracking async context seems be that if a
> > > > >  process T1 in group G1 mmaps a big file and then another process T2 in
> > > > >  group G2, asks for memory and triggers reclaim and generates writes of
> > > > >  the file pages mapped by T1, then these writes should not be charged to
> > > > >  T2, hence blkio_cgroup pages.
> > > > >
> > > > >  But the flip side of this might be that group G2 is a low weight group
> > > > >  and probably too busy also right now, which will delay the write out
> > > > >  and possibly T2 will wait longer for memory to be allocated.
> > > 
> > > In order to avoid this wait, dm-ioband issues IO which has a page with
> > > PG_Reclaim as early as possible.
> > > 
> > 
> > So in above case IO is still charged to G2 but you keep a track if page is
> > PG_Reclaim then releae the this bio before other bios queued up in the
> > group?
> 
> Yes, the bio with PG_Reclaim page is given priority over the other bios.
> 
> > > > > - At one point of time Andrew mentioned that buffered writes are generally a
> > > > >  big problem and one needs to map these to owner's group. Though I am not
> > > > >  very sure what specific problem he was referring to. Can we attribute
> > > > >  buffered writes to pdflush threads and move all pdflush threads in a
> > > > >  cgroup to limit system wide write out activity?
> > > 
> > > I think that buffered writes also should be controlled per cgroup as
> > > well as synchronous writes.
> > > 
> > 
> > But it is hard to achieve fairness for buffered writes becase we don't
> > create complete parallel IO paths and not necessarily higher weight
> > process dispatches more buffered writes to IO scheduler. (Due to page
> > cache buffered write logic).
> > 
> > So in some cases we might see buffered write fairness and in other cases
> > not. For example, run two dd processes in two groups doing buffered writes
> > and it is hard to achieve fairness between these.
> > 
> > That's why the idea that if we can't ensure Buffered write vs Buffered
> > write fairness in all the cases, then does it make sense to attribute
> > buffered writes to pdflush and put pdflush threads into a separate group
> > to limit system wide write out activity. 
> 
> If all buffered writes are treated as system wide activities, it does
> not mean that bandwidth is being controlled. It is true that pdflush
> doesn't do I/O according to weight, but bandwidth (including for
> bufferd writes) should be reserved for each cgroup.
> 
> > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > >  charge those proccess for that swap writeout. These processes never
> > > > >  requested swap IO.
> > > 
> > > I think that swap writeouts should be charged to the memory hogging
> > > process, because the process consumes more resources and it should get
> > > a penalty.
> > > 
> > 
> > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > mechanism and kernel's way of providing extended RAM. If we want to solve
> > the issue of memory hogging by a process then right way to solve is to use
> > memory controller and not by charging the process for IO activity.
> > Instead, proabably a more suitable way is to charge swap activity to root
> > group (where by default all the kernel related activity goes).   
> 
> No. In the current blkio-cgroup, a process which uses a large amount
> of memory gets penalty, not a memory requester.
> 

At ioband level you just get to see bio and page. How do you decide wheter
this bio is being issued by a process which is a memory hog?

In fact requester of memory could be anybody. It could be memory hog or a
different process. So are you saying that you got a mechanism where you 
can detect that a process is memory hog and charge swap activity to it.
IOW, if there are two processes A and B and assume A is the memory hog and
then B requests for memory which triggers lot of swap IO, then you can
charge all that IO to memory hog A?

Can you please point me to the relevant code in dm-ioband?

IMHO, to keep things simple, all swapping activity should be charged to
root group and be considered as kernel activity and user space not be
charged for that.

Thanks
Vivek

> As you wrote, using both io-controller and memory controller are
> required to prevent swap-out caused by memory consumption on another
> cgroup.
> 
> > > > > - If there are multiple buffered writers in the system, then those writers
> > > > >  can also be forced to writeout some pages to disk before they are
> > > > >  allowed to dirty more pages. As per the page cache design, any writer
> > > > >  can pick any inode and start writing out pages. So it can happen a
> > > > >  weight group task is writting out pages dirtied by a lower weight group
> > > > >  task. If, async bio is mapped to owner's group, it might happen that
> > > > >  higher weight group task might be made to sleep on lower weight group
> > > > >  task because request descriptors are all consumed up.
> > > 
> > > As mentioned above, in dm-ioband, the bio is charged to the page owner
> > > and issued immediately.
> > 
> > But you are doing it only for selected pages and not for all buffered
> > writes?
> 
> I'm sorry, I wrote wrong on the previous mail, IO for writing out
> page-cache pages is not issued immediately, it is throttled by
> dm-ioband.
> 
> Anyway, there is a case where a higher weight group task is made
> to sleep, but if we reserve the memory for each cgroup by memory
> controller in advance, we can avoid the task put to sleep.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]                 ` <20090902135821.GB5012-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-03  2:24                   ` Ryo Tsuruta
  0 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-03  2:24 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > > >  charge those proccess for that swap writeout. These processes never
> > > > > >  requested swap IO.
> > > > 
> > > > I think that swap writeouts should be charged to the memory hogging
> > > > process, because the process consumes more resources and it should get
> > > > a penalty.
> > > > 
> > > 
> > > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > > mechanism and kernel's way of providing extended RAM. If we want to solve
> > > the issue of memory hogging by a process then right way to solve is to use
> > > memory controller and not by charging the process for IO activity.
> > > Instead, proabably a more suitable way is to charge swap activity to root
> > > group (where by default all the kernel related activity goes).   
> > 
> > No. In the current blkio-cgroup, a process which uses a large amount
> > of memory gets penalty, not a memory requester.
> > 
> 
> At ioband level you just get to see bio and page. How do you decide wheter
> this bio is being issued by a process which is a memory hog?
> 
> In fact requester of memory could be anybody. It could be memory hog or a
> different process. So are you saying that you got a mechanism where you 
> can detect that a process is memory hog and charge swap activity to it.
> IOW, if there are two processes A and B and assume A is the memory hog and
> then B requests for memory which triggers lot of swap IO, then you can
> charge all that IO to memory hog A?

When an annoymou page is allocated, blkio-cgroup sets an ID to the
page. And then when the page is going to swap out, dm-ioband can know
who the owner of the page is by retrieving ID from the page.

In the above case, since the pages of the process A are swapped-out, 
dm-ioband charges swap IO to the process A.

> Can you please point me to the relevant code in dm-ioband?
> 
> IMHO, to keep things simple, all swapping activity should be charged to
> root group and be considered as kernel activity and user space not be
> charged for that.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-02 13:58                 ` Vivek Goyal
  (?)
@ 2009-09-03  2:24                 ` Ryo Tsuruta
  2009-09-03  2:40                     ` Vivek Goyal
       [not found]                   ` <20090903.112423.226782505.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  -1 siblings, 2 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-03  2:24 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > > >  charge those proccess for that swap writeout. These processes never
> > > > > >  requested swap IO.
> > > > 
> > > > I think that swap writeouts should be charged to the memory hogging
> > > > process, because the process consumes more resources and it should get
> > > > a penalty.
> > > > 
> > > 
> > > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > > mechanism and kernel's way of providing extended RAM. If we want to solve
> > > the issue of memory hogging by a process then right way to solve is to use
> > > memory controller and not by charging the process for IO activity.
> > > Instead, proabably a more suitable way is to charge swap activity to root
> > > group (where by default all the kernel related activity goes).   
> > 
> > No. In the current blkio-cgroup, a process which uses a large amount
> > of memory gets penalty, not a memory requester.
> > 
> 
> At ioband level you just get to see bio and page. How do you decide wheter
> this bio is being issued by a process which is a memory hog?
> 
> In fact requester of memory could be anybody. It could be memory hog or a
> different process. So are you saying that you got a mechanism where you 
> can detect that a process is memory hog and charge swap activity to it.
> IOW, if there are two processes A and B and assume A is the memory hog and
> then B requests for memory which triggers lot of swap IO, then you can
> charge all that IO to memory hog A?

When an annoymou page is allocated, blkio-cgroup sets an ID to the
page. And then when the page is going to swap out, dm-ioband can know
who the owner of the page is by retrieving ID from the page.

In the above case, since the pages of the process A are swapped-out, 
dm-ioband charges swap IO to the process A.

> Can you please point me to the relevant code in dm-ioband?
> 
> IMHO, to keep things simple, all swapping activity should be charged to
> root group and be considered as kernel activity and user space not be
> charged for that.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]                   ` <20090903.112423.226782505.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-03  2:40                     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-03  2:40 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 03, 2009 at 11:24:23AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > > > >  charge those proccess for that swap writeout. These processes never
> > > > > > >  requested swap IO.
> > > > > 
> > > > > I think that swap writeouts should be charged to the memory hogging
> > > > > process, because the process consumes more resources and it should get
> > > > > a penalty.
> > > > > 
> > > > 
> > > > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > > > mechanism and kernel's way of providing extended RAM. If we want to solve
> > > > the issue of memory hogging by a process then right way to solve is to use
> > > > memory controller and not by charging the process for IO activity.
> > > > Instead, proabably a more suitable way is to charge swap activity to root
> > > > group (where by default all the kernel related activity goes).   
> > > 
> > > No. In the current blkio-cgroup, a process which uses a large amount
> > > of memory gets penalty, not a memory requester.
> > > 
> > 
> > At ioband level you just get to see bio and page. How do you decide wheter
> > this bio is being issued by a process which is a memory hog?
> > 
> > In fact requester of memory could be anybody. It could be memory hog or a
> > different process. So are you saying that you got a mechanism where you 
> > can detect that a process is memory hog and charge swap activity to it.
> > IOW, if there are two processes A and B and assume A is the memory hog and
> > then B requests for memory which triggers lot of swap IO, then you can
> > charge all that IO to memory hog A?
> 
> When an annoymou page is allocated, blkio-cgroup sets an ID to the
> page. And then when the page is going to swap out, dm-ioband can know
> who the owner of the page is by retrieving ID from the page.
> 
> In the above case, since the pages of the process A are swapped-out, 
> dm-ioband charges swap IO to the process A.
> 

But this does not mean that in all cases memory hog is being charged for
swap IO, as you have said. So if a process A has done some anonymous page
allocations and later a memory hog B comes in and forces swap out of A, 
you will charge A for swap activity which does not seem fair as B is
memory hog here?

Thanks
Vivek

> > Can you please point me to the relevant code in dm-ioband?
> > 
> > IMHO, to keep things simple, all swapping activity should be charged to
> > root group and be considered as kernel activity and user space not be
> > charged for that.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-03  2:24                 ` Ryo Tsuruta
@ 2009-09-03  2:40                     ` Vivek Goyal
       [not found]                   ` <20090903.112423.226782505.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-03  2:40 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

On Thu, Sep 03, 2009 at 11:24:23AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > > > >  charge those proccess for that swap writeout. These processes never
> > > > > > >  requested swap IO.
> > > > > 
> > > > > I think that swap writeouts should be charged to the memory hogging
> > > > > process, because the process consumes more resources and it should get
> > > > > a penalty.
> > > > > 
> > > > 
> > > > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > > > mechanism and kernel's way of providing extended RAM. If we want to solve
> > > > the issue of memory hogging by a process then right way to solve is to use
> > > > memory controller and not by charging the process for IO activity.
> > > > Instead, proabably a more suitable way is to charge swap activity to root
> > > > group (where by default all the kernel related activity goes).   
> > > 
> > > No. In the current blkio-cgroup, a process which uses a large amount
> > > of memory gets penalty, not a memory requester.
> > > 
> > 
> > At ioband level you just get to see bio and page. How do you decide wheter
> > this bio is being issued by a process which is a memory hog?
> > 
> > In fact requester of memory could be anybody. It could be memory hog or a
> > different process. So are you saying that you got a mechanism where you 
> > can detect that a process is memory hog and charge swap activity to it.
> > IOW, if there are two processes A and B and assume A is the memory hog and
> > then B requests for memory which triggers lot of swap IO, then you can
> > charge all that IO to memory hog A?
> 
> When an annoymou page is allocated, blkio-cgroup sets an ID to the
> page. And then when the page is going to swap out, dm-ioband can know
> who the owner of the page is by retrieving ID from the page.
> 
> In the above case, since the pages of the process A are swapped-out, 
> dm-ioband charges swap IO to the process A.
> 

But this does not mean that in all cases memory hog is being charged for
swap IO, as you have said. So if a process A has done some anonymous page
allocations and later a memory hog B comes in and forces swap out of A, 
you will charge A for swap activity which does not seem fair as B is
memory hog here?

Thanks
Vivek

> > Can you please point me to the relevant code in dm-ioband?
> > 
> > IMHO, to keep things simple, all swapping activity should be charged to
> > root group and be considered as kernel activity and user space not be
> > charged for that.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-03  2:40                     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-03  2:40 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Thu, Sep 03, 2009 at 11:24:23AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > - Somebody also gave an example where there is a memory hogging process and
> > > > > > >  possibly pushes out some processes to swap. It does not sound fair to
> > > > > > >  charge those proccess for that swap writeout. These processes never
> > > > > > >  requested swap IO.
> > > > > 
> > > > > I think that swap writeouts should be charged to the memory hogging
> > > > > process, because the process consumes more resources and it should get
> > > > > a penalty.
> > > > > 
> > > > 
> > > > A process requesting memory gets IO penalty? IMHO, swapping is a kernel 
> > > > mechanism and kernel's way of providing extended RAM. If we want to solve
> > > > the issue of memory hogging by a process then right way to solve is to use
> > > > memory controller and not by charging the process for IO activity.
> > > > Instead, proabably a more suitable way is to charge swap activity to root
> > > > group (where by default all the kernel related activity goes).   
> > > 
> > > No. In the current blkio-cgroup, a process which uses a large amount
> > > of memory gets penalty, not a memory requester.
> > > 
> > 
> > At ioband level you just get to see bio and page. How do you decide wheter
> > this bio is being issued by a process which is a memory hog?
> > 
> > In fact requester of memory could be anybody. It could be memory hog or a
> > different process. So are you saying that you got a mechanism where you 
> > can detect that a process is memory hog and charge swap activity to it.
> > IOW, if there are two processes A and B and assume A is the memory hog and
> > then B requests for memory which triggers lot of swap IO, then you can
> > charge all that IO to memory hog A?
> 
> When an annoymou page is allocated, blkio-cgroup sets an ID to the
> page. And then when the page is going to swap out, dm-ioband can know
> who the owner of the page is by retrieving ID from the page.
> 
> In the above case, since the pages of the process A are swapped-out, 
> dm-ioband charges swap IO to the process A.
> 

But this does not mean that in all cases memory hog is being charged for
swap IO, as you have said. So if a process A has done some anonymous page
allocations and later a memory hog B comes in and forces swap out of A, 
you will charge A for swap activity which does not seem fair as B is
memory hog here?

Thanks
Vivek

> > Can you please point me to the relevant code in dm-ioband?
> > 
> > IMHO, to keep things simple, all swapping activity should be charged to
> > root group and be considered as kernel activity and user space not be
> > charged for that.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1251495072-7780-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-08-29 23:04     ` Rik van Riel
@ 2009-09-03  3:08     ` Munehiro Ikeda
  1 sibling, 0 replies; 322+ messages in thread
From: Munehiro Ikeda @ 2009-09-03  3:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> +static struct io_group *io_find_alloc_group(struct request_queue *q,
> +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> +			int create)
> +{
> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> +	struct io_group *iog = NULL;
> +	/* Note: Use efqd as key */
> +	void *key = efqd;
> +
> +	/*
> +	 * Take a refenrece to css object. Don't want to map a bio to
> +	 * a group if it has been marked for deletion
> +	 */
> +
> +	if (!css_tryget(&iocg->css))
> +		return iog;

cgroup_to_io_cgroup() returns NULL if only blkio subsystem
is mounted but io subsystem is not.  It can cause NULL pointer
access.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
 block/elevator-fq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..6714e73 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
         * a group if it has been marked for deletion
         */
 
-       if (!css_tryget(&iocg->css))
+       if (!iocg || !css_tryget(&iocg->css))
                return iog;
 
        iog = io_cgroup_lookup_group(iocg, key);
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-08-28 21:30   ` Vivek Goyal
@ 2009-09-03  3:08     ` Munehiro Ikeda
  -1 siblings, 0 replies; 322+ messages in thread
From: Munehiro Ikeda @ 2009-09-03  3:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

Hi,

Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> +static struct io_group *io_find_alloc_group(struct request_queue *q,
> +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> +			int create)
> +{
> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> +	struct io_group *iog = NULL;
> +	/* Note: Use efqd as key */
> +	void *key = efqd;
> +
> +	/*
> +	 * Take a refenrece to css object. Don't want to map a bio to
> +	 * a group if it has been marked for deletion
> +	 */
> +
> +	if (!css_tryget(&iocg->css))
> +		return iog;

cgroup_to_io_cgroup() returns NULL if only blkio subsystem
is mounted but io subsystem is not.  It can cause NULL pointer
access.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/elevator-fq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..6714e73 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
         * a group if it has been marked for deletion
         */
 
-       if (!css_tryget(&iocg->css))
+       if (!iocg || !css_tryget(&iocg->css))
                return iog;
 
        iog = io_cgroup_lookup_group(iocg, key);
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-09-03  3:08     ` Munehiro Ikeda
  0 siblings, 0 replies; 322+ messages in thread
From: Munehiro Ikeda @ 2009-09-03  3:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, torvalds

Hi,

Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> +static struct io_group *io_find_alloc_group(struct request_queue *q,
> +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> +			int create)
> +{
> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> +	struct io_group *iog = NULL;
> +	/* Note: Use efqd as key */
> +	void *key = efqd;
> +
> +	/*
> +	 * Take a refenrece to css object. Don't want to map a bio to
> +	 * a group if it has been marked for deletion
> +	 */
> +
> +	if (!css_tryget(&iocg->css))
> +		return iog;

cgroup_to_io_cgroup() returns NULL if only blkio subsystem
is mounted but io subsystem is not.  It can cause NULL pointer
access.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/elevator-fq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..6714e73 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
         * a group if it has been marked for deletion
         */
 
-       if (!css_tryget(&iocg->css))
+       if (!iocg || !css_tryget(&iocg->css))
                return iog;
 
        iog = io_cgroup_lookup_group(iocg, key);
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
       [not found]                     ` <20090903024014.GA8644-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-03  3:41                       ` Ryo Tsuruta
  0 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-03  3:41 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > At ioband level you just get to see bio and page. How do you decide wheter
> > > this bio is being issued by a process which is a memory hog?
> > > 
> > > In fact requester of memory could be anybody. It could be memory hog or a
> > > different process. So are you saying that you got a mechanism where you 
> > > can detect that a process is memory hog and charge swap activity to it.
> > > IOW, if there are two processes A and B and assume A is the memory hog and
> > > then B requests for memory which triggers lot of swap IO, then you can
> > > charge all that IO to memory hog A?
> > 
> > When an annoymou page is allocated, blkio-cgroup sets an ID to the
> > page. And then when the page is going to swap out, dm-ioband can know
> > who the owner of the page is by retrieving ID from the page.
> > 
> > In the above case, since the pages of the process A are swapped-out, 
> > dm-ioband charges swap IO to the process A.
> > 
> 
> But this does not mean that in all cases memory hog is being charged for
> swap IO, as you have said. So if a process A has done some anonymous page
> allocations and later a memory hog B comes in and forces swap out of A, 
> you will charge A for swap activity which does not seem fair as B is
> memory hog here?

I think this charging policy is not bad, but I think I can understand
you think it's not fair. Do you think it's fair if all IO is carged to B?

We should use both io and memory controller together, as you wrote, 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
  2009-09-03  2:40                     ` Vivek Goyal
@ 2009-09-03  3:41                       ` Ryo Tsuruta
  -1 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-03  3:41 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, riel, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > > At ioband level you just get to see bio and page. How do you decide wheter
> > > this bio is being issued by a process which is a memory hog?
> > > 
> > > In fact requester of memory could be anybody. It could be memory hog or a
> > > different process. So are you saying that you got a mechanism where you 
> > > can detect that a process is memory hog and charge swap activity to it.
> > > IOW, if there are two processes A and B and assume A is the memory hog and
> > > then B requests for memory which triggers lot of swap IO, then you can
> > > charge all that IO to memory hog A?
> > 
> > When an annoymou page is allocated, blkio-cgroup sets an ID to the
> > page. And then when the page is going to swap out, dm-ioband can know
> > who the owner of the page is by retrieving ID from the page.
> > 
> > In the above case, since the pages of the process A are swapped-out, 
> > dm-ioband charges swap IO to the process A.
> > 
> 
> But this does not mean that in all cases memory hog is being charged for
> swap IO, as you have said. So if a process A has done some anonymous page
> allocations and later a memory hog B comes in and forces swap out of A, 
> you will charge A for swap activity which does not seem fair as B is
> memory hog here?

I think this charging policy is not bad, but I think I can understand
you think it's not fair. Do you think it's fair if all IO is carged to B?

We should use both io and memory controller together, as you wrote, 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios.
@ 2009-09-03  3:41                       ` Ryo Tsuruta
  0 siblings, 0 replies; 322+ messages in thread
From: Ryo Tsuruta @ 2009-09-03  3:41 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > > At ioband level you just get to see bio and page. How do you decide wheter
> > > this bio is being issued by a process which is a memory hog?
> > > 
> > > In fact requester of memory could be anybody. It could be memory hog or a
> > > different process. So are you saying that you got a mechanism where you 
> > > can detect that a process is memory hog and charge swap activity to it.
> > > IOW, if there are two processes A and B and assume A is the memory hog and
> > > then B requests for memory which triggers lot of swap IO, then you can
> > > charge all that IO to memory hog A?
> > 
> > When an annoymou page is allocated, blkio-cgroup sets an ID to the
> > page. And then when the page is going to swap out, dm-ioband can know
> > who the owner of the page is by retrieving ID from the page.
> > 
> > In the above case, since the pages of the process A are swapped-out, 
> > dm-ioband charges swap IO to the process A.
> > 
> 
> But this does not mean that in all cases memory hog is being charged for
> swap IO, as you have said. So if a process A has done some anonymous page
> allocations and later a memory hog B comes in and forces swap out of A, 
> you will charge A for swap activity which does not seem fair as B is
> memory hog here?

I think this charging policy is not bad, but I think I can understand
you think it's not fair. Do you think it's fair if all IO is carged to B?

We should use both io and memory controller together, as you wrote, 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4A9DC33E.6000408-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-09-02 13:45     ` Vivek Goyal
@ 2009-09-07  2:14     ` Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Here is the vanilla kernel and V9 comparision.

                           Normal Read  |  Random Read  |  Normal Write  |  Random Write

vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s

V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s

Performance                +2.0%           +0.7%          -2.1%             +0.3%

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-02  0:58   ` Gui Jianfeng
@ 2009-09-07  2:14     ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Hi Vivek,

Here is the vanilla kernel and V9 comparision.

                           Normal Read  |  Random Read  |  Normal Write  |  Random Write

vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s

V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s

Performance                +2.0%           +0.7%          -2.1%             +0.3%



^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-07  2:14     ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  2:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Hi Vivek,

Here is the vanilla kernel and V9 comparision.

                           Normal Read  |  Random Read  |  Normal Write  |  Random Write

vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s

V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s

Performance                +2.0%           +0.7%          -2.1%             +0.3%

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (24 preceding siblings ...)
  2009-09-02  0:58   ` Gui Jianfeng
@ 2009-09-07  7:40   ` Gui Jianfeng
  2009-09-08 22:28   ` Vivek Goyal
                     ` (4 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

I happened to encount a bug when i test IO Controller V9.
When there are three tasks to run concurrently in three group,
that is, one is parent group, and other two tasks are running 
in two different child groups respectively to read or write 
files in some disk, say disk "hdb", The task may hang up, and 
other tasks which access into "hdb" will also hang up.

The bug only happens when using AS io scheduler.
The following scirpt can reproduce this bug in my box.

===========
#!/bin/sh

mkdir /cgroup
mount -t cgroup -o io,blkio io  /cgroup

echo anticipatory > /sys/block/hdb/queue/scheduler

mkdir /cgroup/test1
echo 100 > /cgroup/test1/io.weight

mkdir /cgroup/test2
echo 400 > /cgroup/test2/io.weight

mkdir /cgroup/test2/test3
echo 400 > /cgroup/test2/test3/io.weight

mkdir /cgroup/test2/test4
echo 400 > /cgroup/test2/test4/io.weight

#./rwio -r -f /hdb2/2000M.3 &
dd if=/hdb2/2000M.3 of=/dev/null &
pid4=$!
echo $pid4 > /cgroup/test2/test3/tasks
echo "pid4: $pid4"

#./rwio -r -f /hdb2/2000M.1 &
dd if=/hdb2/2000M.1 of=/dev/null &
pid1=$!
echo $pid1 > /cgroup/test1/tasks
echo "pid1 $pid1"

#./rwio -r -f /hdb2/2000M.2 &
dd if=/hdb2/2000M.2 of=/dev/null &
pid2=$!
echo $pid2 > /cgroup/test2/test4/tasks
echo "pid2 $pid2"

sleep 20

for ((;1;))
{
        ps -p $pid1 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid1 > /dev/null 2>&1
}
for ((;1;))
{
        ps -p $pid2 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid2 > /dev/null 2>&1
}


kill -9 $pid4 > /dev/null 2>&1

rmdir /cgroup/test2/test3
rmdir /cgroup/test2/test4
rmdir /cgroup/test2
rmdir /cgroup/test1

umount /cgroup
rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-09-07  7:40   ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Hi Vivek,

I happened to encount a bug when i test IO Controller V9.
When there are three tasks to run concurrently in three group,
that is, one is parent group, and other two tasks are running 
in two different child groups respectively to read or write 
files in some disk, say disk "hdb", The task may hang up, and 
other tasks which access into "hdb" will also hang up.

The bug only happens when using AS io scheduler.
The following scirpt can reproduce this bug in my box.

===========
#!/bin/sh

mkdir /cgroup
mount -t cgroup -o io,blkio io  /cgroup

echo anticipatory > /sys/block/hdb/queue/scheduler

mkdir /cgroup/test1
echo 100 > /cgroup/test1/io.weight

mkdir /cgroup/test2
echo 400 > /cgroup/test2/io.weight

mkdir /cgroup/test2/test3
echo 400 > /cgroup/test2/test3/io.weight

mkdir /cgroup/test2/test4
echo 400 > /cgroup/test2/test4/io.weight

#./rwio -r -f /hdb2/2000M.3 &
dd if=/hdb2/2000M.3 of=/dev/null &
pid4=$!
echo $pid4 > /cgroup/test2/test3/tasks
echo "pid4: $pid4"

#./rwio -r -f /hdb2/2000M.1 &
dd if=/hdb2/2000M.1 of=/dev/null &
pid1=$!
echo $pid1 > /cgroup/test1/tasks
echo "pid1 $pid1"

#./rwio -r -f /hdb2/2000M.2 &
dd if=/hdb2/2000M.2 of=/dev/null &
pid2=$!
echo $pid2 > /cgroup/test2/test4/tasks
echo "pid2 $pid2"

sleep 20

for ((;1;))
{
        ps -p $pid1 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid1 > /dev/null 2>&1
}
for ((;1;))
{
        ps -p $pid2 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid2 > /dev/null 2>&1
}


kill -9 $pid4 > /dev/null 2>&1

rmdir /cgroup/test2/test3
rmdir /cgroup/test2/test4
rmdir /cgroup/test2
rmdir /cgroup/test1

umount /cgroup
rmdir /cgroup


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-07  7:40   ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-07  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Hi Vivek,

I happened to encount a bug when i test IO Controller V9.
When there are three tasks to run concurrently in three group,
that is, one is parent group, and other two tasks are running 
in two different child groups respectively to read or write 
files in some disk, say disk "hdb", The task may hang up, and 
other tasks which access into "hdb" will also hang up.

The bug only happens when using AS io scheduler.
The following scirpt can reproduce this bug in my box.

===========
#!/bin/sh

mkdir /cgroup
mount -t cgroup -o io,blkio io  /cgroup

echo anticipatory > /sys/block/hdb/queue/scheduler

mkdir /cgroup/test1
echo 100 > /cgroup/test1/io.weight

mkdir /cgroup/test2
echo 400 > /cgroup/test2/io.weight

mkdir /cgroup/test2/test3
echo 400 > /cgroup/test2/test3/io.weight

mkdir /cgroup/test2/test4
echo 400 > /cgroup/test2/test4/io.weight

#./rwio -r -f /hdb2/2000M.3 &
dd if=/hdb2/2000M.3 of=/dev/null &
pid4=$!
echo $pid4 > /cgroup/test2/test3/tasks
echo "pid4: $pid4"

#./rwio -r -f /hdb2/2000M.1 &
dd if=/hdb2/2000M.1 of=/dev/null &
pid1=$!
echo $pid1 > /cgroup/test1/tasks
echo "pid1 $pid1"

#./rwio -r -f /hdb2/2000M.2 &
dd if=/hdb2/2000M.2 of=/dev/null &
pid2=$!
echo $pid2 > /cgroup/test2/test4/tasks
echo "pid2 $pid2"

sleep 20

for ((;1;))
{
        ps -p $pid1 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid1 > /dev/null 2>&1
}
for ((;1;))
{
        ps -p $pid2 > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                break
        fi

        kill -9 $pid2 > /dev/null 2>&1
}


kill -9 $pid4 > /dev/null 2>&1

rmdir /cgroup/test2/test3
rmdir /cgroup/test2/test4
rmdir /cgroup/test2
rmdir /cgroup/test1

umount /cgroup
rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4AA4B905.8010801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-08 13:53     ` Vivek Goyal
  2009-09-08 19:19     ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:53 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Thanks for the testing it out Gui. I will run this test case on my machine
and see if I can reproduce this issue on my box and try to fix it.

Is your box completely hung or IO scheduler don't seem to be doing
anything. Can you try to swith the io scheduler to something else (after
it appears to be hung), and see if switch is successful and new scheduler
starts working?

Thanks
Vivek

> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-07  7:40   ` Gui Jianfeng
@ 2009-09-08 13:53     ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:53 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Thanks for the testing it out Gui. I will run this test case on my machine
and see if I can reproduce this issue on my box and try to fix it.

Is your box completely hung or IO scheduler don't seem to be doing
anything. Can you try to swith the io scheduler to something else (after
it appears to be hung), and see if switch is successful and new scheduler
starts working?

Thanks
Vivek

> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-08 13:53     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:53 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Thanks for the testing it out Gui. I will run this test case on my machine
and see if I can reproduce this issue on my box and try to fix it.

Is your box completely hung or IO scheduler don't seem to be doing
anything. Can you try to swith the io scheduler to something else (after
it appears to be hung), and see if switch is successful and new scheduler
starts working?

Thanks
Vivek

> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]     ` <4AA46C6E.4010109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-08 13:55       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 07, 2009 at 10:14:06AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> Here is the vanilla kernel and V9 comparision.
> 
>                            Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s
> 
> V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s
> 
> Performance                +2.0%           +0.7%          -2.1%             +0.3%
> 

Thanks Gui. So that's lot of improvements from previous versions. I remember
you were noticing 7% regressions in buffered write performance.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-07  2:14     ` Gui Jianfeng
@ 2009-09-08 13:55       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Mon, Sep 07, 2009 at 10:14:06AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> Here is the vanilla kernel and V9 comparision.
> 
>                            Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s
> 
> V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s
> 
> Performance                +2.0%           +0.7%          -2.1%             +0.3%
> 

Thanks Gui. So that's lot of improvements from previous versions. I remember
you were noticing 7% regressions in buffered write performance.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-08 13:55       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 13:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 07, 2009 at 10:14:06AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> Here is the vanilla kernel and V9 comparision.
> 
>                            Normal Read  |  Random Read  |  Normal Write  |  Random Write
> 
> vanilla(Avg of 3 times)   67580 KiB/s     3540 KiB/s      61964 KiB/s      9823 KiB/s
> 
> V9(Avg of 3 times)        68954 KiB/s     3567 KiB/s      60654 KiB/s      9858 KiB/s
> 
> Performance                +2.0%           +0.7%          -2.1%             +0.3%
> 

Thanks Gui. So that's lot of improvements from previous versions. I remember
you were noticing 7% regressions in buffered write performance.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4AA4B905.8010801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-09-08 13:53     ` Vivek Goyal
@ 2009-09-08 19:19     ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 19:19 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Hi Gui,

I tried reproducing this on my system and can't reproduce it. All the
three processes get killed and system does not hang.

Can you please dig deeper a bit into it. 

- If whole system hangs or it is just IO to disk seems to be hung.
- Does io scheduler switch on the device work
- If the system is not hung, can you capture the blktrace on the device.
  Trace might give some idea, what's happening.

Thanks
Vivek
 
> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-07  7:40   ` Gui Jianfeng
@ 2009-09-08 19:19     ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 19:19 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Hi Gui,

I tried reproducing this on my system and can't reproduce it. All the
three processes get killed and system does not hang.

Can you please dig deeper a bit into it. 

- If whole system hangs or it is just IO to disk seems to be hung.
- Does io scheduler switch on the device work
- If the system is not hung, can you capture the blktrace on the device.
  Trace might give some idea, what's happening.

Thanks
Vivek
 
> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-08 19:19     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 19:19 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> I happened to encount a bug when i test IO Controller V9.
> When there are three tasks to run concurrently in three group,
> that is, one is parent group, and other two tasks are running 
> in two different child groups respectively to read or write 
> files in some disk, say disk "hdb", The task may hang up, and 
> other tasks which access into "hdb" will also hang up.
> 
> The bug only happens when using AS io scheduler.
> The following scirpt can reproduce this bug in my box.
> 

Hi Gui,

I tried reproducing this on my system and can't reproduce it. All the
three processes get killed and system does not hang.

Can you please dig deeper a bit into it. 

- If whole system hangs or it is just IO to disk seems to be hung.
- Does io scheduler switch on the device work
- If the system is not hung, can you capture the blktrace on the device.
  Trace might give some idea, what's happening.

Thanks
Vivek
 
> ===========
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io,blkio io  /cgroup
> 
> echo anticipatory > /sys/block/hdb/queue/scheduler
> 
> mkdir /cgroup/test1
> echo 100 > /cgroup/test1/io.weight
> 
> mkdir /cgroup/test2
> echo 400 > /cgroup/test2/io.weight
> 
> mkdir /cgroup/test2/test3
> echo 400 > /cgroup/test2/test3/io.weight
> 
> mkdir /cgroup/test2/test4
> echo 400 > /cgroup/test2/test4/io.weight
> 
> #./rwio -r -f /hdb2/2000M.3 &
> dd if=/hdb2/2000M.3 of=/dev/null &
> pid4=$!
> echo $pid4 > /cgroup/test2/test3/tasks
> echo "pid4: $pid4"
> 
> #./rwio -r -f /hdb2/2000M.1 &
> dd if=/hdb2/2000M.1 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> echo "pid1 $pid1"
> 
> #./rwio -r -f /hdb2/2000M.2 &
> dd if=/hdb2/2000M.2 of=/dev/null &
> pid2=$!
> echo $pid2 > /cgroup/test2/test4/tasks
> echo "pid2 $pid2"
> 
> sleep 20
> 
> for ((;1;))
> {
>         ps -p $pid1 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid1 > /dev/null 2>&1
> }
> for ((;1;))
> {
>         ps -p $pid2 > /dev/null 2>&1
>         if [ $? -ne 0 ]; then
>                 break
>         fi
> 
>         kill -9 $pid2 > /dev/null 2>&1
> }
> 
> 
> kill -9 $pid4 > /dev/null 2>&1
> 
> rmdir /cgroup/test2/test3
> rmdir /cgroup/test2/test4
> rmdir /cgroup/test2
> rmdir /cgroup/test1
> 
> umount /cgroup
> rmdir /cgroup

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (25 preceding siblings ...)
  2009-09-07  7:40   ` Gui Jianfeng
@ 2009-09-08 22:28   ` Vivek Goyal
  2009-09-08 22:28   ` [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle Vivek Goyal
                     ` (3 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Aug 28, 2009 at 05:30:49PM -0400, Vivek Goyal wrote:
> 
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 

Found few issues during testing. Sending 3 more patches for this series.
After some more testing, will merge these patches in higher level patches
and post V10.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel

On Fri, Aug 28, 2009 at 05:30:49PM -0400, Vivek Goyal wrote:
> 
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 

Found few issues during testing. Sending 3 more patches for this series.
After some more testing, will merge these patches in higher level patches
and post V10.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-08 22:28   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers, akpm,
	righi.andrea, torvalds

On Fri, Aug 28, 2009 at 05:30:49PM -0400, Vivek Goyal wrote:
> 
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> 
> For ease of patching, a consolidated patch is available here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
> 

Found few issues during testing. Sending 3 more patches for this series.
After some more testing, will merge these patches in higher level patches
and post V10.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (26 preceding siblings ...)
  2009-09-08 22:28   ` Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  2009-09-08 22:28   ` [PATCH 25/23] io-controller: fix queue vs group fairness Vivek Goyal
                     ` (2 subsequent siblings)
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


o It is possible that when there is only a single queue in the system, it
  remains unexpired for a long time (because there is no IO activity on the
  disk). So when next request comes in after a long time, it might make
  scheduler think that all this while queue used the disk and it will assign
  a high vdisktime to the queue. Hence make sure queue is expired once all
  the requests have completed from the queue.

o Also avoid unnecessarily expiring a queue when it has got one request
  dispatched to the queue and waiting for it to finish and it does not have
  more requests queued to dispatch.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
@@ -2947,6 +2947,10 @@ void *elv_select_ioq(struct request_queu
 	if (ioq == NULL)
 		goto new_queue;
 
+	/* There is only one active queue which is empty. Nothing to dispatch */
+	if (elv_nr_busy_ioq(q->elevator) == 1 && !ioq->nr_queued)
+		return NULL;
+
 	iog = ioq_to_io_group(ioq);
 
 	/*
@@ -3236,6 +3240,17 @@ void elv_ioq_completed_request(struct re
 			else
 				elv_iog_arm_slice_timer(q, iog, 0);
 		}
+
+		/*
+		 * if this is only queue and it has completed all its requests
+		 * and has nothing to dispatch, expire it. We don't want to
+		 * keep it around idle otherwise later when it is expired, all
+		 * this idle time will be added to queue's disk time used.
+		 */
+		if (efqd->busy_queues == 1 && !ioq->dispatched &&
+		   !ioq->nr_queued && !timer_pending(&efqd->idle_slice_timer)) {
+			elv_slice_expired(q);
+		}
 	}
 done:
 	if (!efqd->rq_in_driver)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle
  2009-08-28 21:30 ` Vivek Goyal
                   ` (28 preceding siblings ...)
  (?)
@ 2009-09-08 22:28 ` Vivek Goyal
       [not found]   ` <20090908222821.GB3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-09  3:39   ` Rik van Riel
  -1 siblings, 2 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel


o It is possible that when there is only a single queue in the system, it
  remains unexpired for a long time (because there is no IO activity on the
  disk). So when next request comes in after a long time, it might make
  scheduler think that all this while queue used the disk and it will assign
  a high vdisktime to the queue. Hence make sure queue is expired once all
  the requests have completed from the queue.

o Also avoid unnecessarily expiring a queue when it has got one request
  dispatched to the queue and waiting for it to finish and it does not have
  more requests queued to dispatch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
@@ -2947,6 +2947,10 @@ void *elv_select_ioq(struct request_queu
 	if (ioq == NULL)
 		goto new_queue;
 
+	/* There is only one active queue which is empty. Nothing to dispatch */
+	if (elv_nr_busy_ioq(q->elevator) == 1 && !ioq->nr_queued)
+		return NULL;
+
 	iog = ioq_to_io_group(ioq);
 
 	/*
@@ -3236,6 +3240,17 @@ void elv_ioq_completed_request(struct re
 			else
 				elv_iog_arm_slice_timer(q, iog, 0);
 		}
+
+		/*
+		 * if this is only queue and it has completed all its requests
+		 * and has nothing to dispatch, expire it. We don't want to
+		 * keep it around idle otherwise later when it is expired, all
+		 * this idle time will be added to queue's disk time used.
+		 */
+		if (efqd->busy_queues == 1 && !ioq->dispatched &&
+		   !ioq->nr_queued && !timer_pending(&efqd->idle_slice_timer)) {
+			elv_slice_expired(q);
+		}
 	}
 done:
 	if (!efqd->rq_in_driver)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (27 preceding siblings ...)
  2009-09-08 22:28   ` [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  2009-09-08 22:28   ` [PATCH 26/23] io-controller: fix writer preemption with in a group Vivek Goyal
  2009-09-10 15:18   ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


o I found an issue during test and that is if there is a mix of queue and group
  at same level, there can be fairness issue. For example, consider following
  case.

			root
			/ \
		       T1  G1
			    |
			    T2

 T1 and T2 are two tasks with prio 0 and 7 respectively and G1 is the group
 with weight 900. 

 Task T1 prio 0 is mapped to weight 900 and it will get slice length of 180ms
 and then queue will expire and will be put after G1. (Note, in case of reader
 most liekly next request will come after queue expiry hence queue will be
 deleted and once the request comes in again, it will added to tree fresh. A
 fresh queue is added at the end of the tree. So it will be put after G1.).

 Now G1 will get to run (effectivly T2 will run), T2 has prio 7, which will
 map to weight 200 and get slice length of 40ms and will expire after that. Now
 G1 will a new vtime which is effectively charge of 40ms.

 Now to get fairness G1 should run more but instead T1 will be running as we
 gave it a vtime, same as G1. 

 The core issue here is that for readers, when slice expires, queue is empty
 and not backlogged hence it gets deleted from the tree. Because CFQ only
 operates in flat mode, it did a smart thing and did not keep a track of
 history. Instead it provides slice lenghts according to prio and if in one
 round of dispatch one gets fairness it is fine, otherwise upon queue expiry
 you will be placed at the end of service tree.

 This does not work in hierarchical setups where group's slice lenght is
 determined not by group' weight but by the weight of the queue which will
 run under the group.

 Hence we need to keep track of histroy and assign a new vtime based on disk
 time used by the current queue at the time of expiry.

 But here io scheduler is little different from CFS that at the time of expiry
 most of the time reader's queue is empty. So one will end up deleting it from
 the service tree and next request comes with-in 1 ms and it gets into the tree
 again like a new process.

 So we need to keep track of process io queue's vdisktime, even it after got
 deleted from io scheduler's service tree and use that same vdisktime if that
 queue gets backlogged again. But trusting a ioq's vdisktime is bad because
 it can lead to issues if a service tree min_vtime wrap around takes place 
 between two requests of the queue. (Agreed that it can be not that easy to
 hit but it is possible).

 Hence, keep a cache of io queues serviced recently and when a queue gets
 backlogged, if it is found in cache, use that vdisktime otherwise assign
 a new vdisktime. This cache of io queues (idle tree), is basically the idea
 implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
 bringing it back. (Now I understand it better. :-)).

 There is one good side affect of keeping the cache of recently service io
 queues. Now CFQ can differentiate between streaming readers and new processes
 doing IO. Now for a new queue (which is not in the cache), we can assign a
 lower vdisktime and for a streaming reader, we assign vdisktime based on disk
 time used. This way small file readers or the processes doing small amount
 of IO will have reduced latencies at the cost of little reduced throughput of
 streaming readers.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    2 
 block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |    9 +
 3 files changed, 246 insertions(+), 17 deletions(-)

Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
@@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
 #define elv_log_entity(entity, fmt, args...)
 #endif
 
+static void check_idle_tree_release(struct io_service_tree *st);
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
 	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
 }
 
+static inline int vdisktime_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
 {
 	s64 delta = (s64)(vdisktime - min_vdisktime);
@@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
 	}
 
 	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	check_idle_tree_release(st);
 }
 
 static inline struct io_entity *parent_entity(struct io_entity *entity)
@@ -411,27 +419,46 @@ static void place_entity(struct io_servi
 	struct rb_node *parent;
 	struct io_entity *entry;
 	int nr_active = st->nr_active - 1;
+	struct io_queue *ioq = ioq_of(entity);
+	int sync = 1;
+
+	if (ioq)
+		sync = elv_ioq_sync(ioq);
+
+	if (add_front || !nr_active) {
+		vdisktime = st->min_vdisktime;
+		goto done;
+	}
+
+	if (sync && entity->vdisktime
+	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
+		/* vdisktime still in future. Use old vdisktime */
+		vdisktime = entity->vdisktime;
+		goto done;
+	}
 
 	/*
-	 * Currently put entity at the end of last entity. This probably will
-	 * require adjustments as we move along
+	 * Effectively a new queue. Assign sync queue a lower vdisktime so
+	 * we can achieve better latencies for small file readers. For async
+	 * queues, put them at the end of the existing queue.
+	 * Group entities are always considered sync.
 	 */
-	if (io_entity_class_idle(entity)) {
-		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime += entry->vdisktime;
-		}
-	} else if (!add_front && nr_active) {
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime = entry->vdisktime;
-		}
-	} else
+	if (sync) {
 		vdisktime = st->min_vdisktime;
+		goto done;
+	}
 
+	/*
+	 * Put entity at the end of the tree. Effectively async queues use
+	 * this path.
+	 */
+	parent = rb_last(&st->active);
+	if (parent) {
+		entry = rb_entry(parent, struct io_entity, rb_node);
+		vdisktime = entry->vdisktime;
+	} else
+		vdisktime = st->min_vdisktime;
+done:
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
 	elv_log_entity(entity, "place_entity: vdisktime=%llu"
 			" min_vdisktime=%llu", entity->vdisktime,
@@ -447,6 +474,122 @@ static inline void io_entity_update_prio
 		 */
 		init_io_entity_service_tree(entity, parent_entity(entity));
 		entity->ioprio_changed = 0;
+
+		/*
+		 * Assign this entity a fresh vdisktime instead of using
+		 * previous one as prio class will lead to service tree
+		 * change and this vdisktime will not be valid on new
+		 * service tree.
+		 *
+		 * TODO: Handle the case of only prio change.
+		 */
+		entity->vdisktime = 0;
+	}
+}
+
+static void
+__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	if (st->rb_leftmost_idle == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost_idle = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->idle);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+
+	__dequeue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 0;
+	if (ioq)
+		elv_put_ioq(ioq);
+}
+
+static void
+__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->idle.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
+			node = &parent->rb_left;
+		else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost_idle = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->idle);
+}
+
+static void enqueue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+	struct io_group *parent_iog;
+
+	/*
+	 * Don't put an entity on idle tree if it has been marked for deletion.
+	 * We are not expecting more io from this entity. No need to cache it
+	 */
+
+	if (entity->exiting)
+		return;
+
+	/*
+	 * If parent group is exiting, don't put on idle tree. May be task got
+	 * moved to a different cgroup and original cgroup got deleted
+	 */
+	parent_iog = iog_of(parent_entity(entity));
+	if (parent_iog->entity.exiting)
+		return;
+
+	if (ioq)
+		elv_get_ioq(ioq);
+	__enqueue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 1;
+}
+
+static void check_idle_tree_release(struct io_service_tree *st)
+{
+	struct io_entity *leftmost;
+
+	if (!st->rb_leftmost_idle)
+		return;
+
+	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
+
+	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
+		dequeue_io_entity_idle(leftmost);
+}
+
+static void flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	while(st->rb_leftmost_idle) {
+		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
+					rb_node);
+		dequeue_io_entity_idle(entity);
 	}
 }
 
@@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
 	st->nr_active--;
 	sd->nr_active--;
 	debug_update_stats_dequeue(entity);
+
+	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
+		enqueue_io_entity_idle(entity);
 }
 
 static void
@@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+	else
+		/*
+		 * This entity was not in idle tree cache. Zero out vdisktime
+		 * so that we don't rely on old vdisktime instead assign a
+		 * fresh one.
+		 */
+		entity->vdisktime = 0;
+
 	io_entity_update_prio(entity);
 	st = entity->st;
 	st->nr_active++;
@@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
+	entity->vdisktime = 0;
+
 	if (add_front) {
 		next_entity = __lookup_next_io_entity(st);
 
@@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
 {
 	struct io_group *iog = e->efqd->root_group;
 	struct io_cgroup *iocg = &io_root_cgroup;
+	struct io_service_tree *st;
+	int i;
 
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	put_io_group_queues(e, iog);
 	elv_put_iog(iog);
 }
@@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
  */
 static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
 {
+	struct io_service_tree *st;
+	int i;
+	struct io_entity *entity = &iog->entity;
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue which is removed from active
+	 * tree will not be put in to idle tree.
+	 */
+	entity->exiting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	hlist_del(&iog->elv_data_node);
 	put_io_group_queues(efqd->eq, iog);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_service_tree *st;
+	int i;
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
 	put_io_group_queues(e, iog);
 	kfree(iog);
 }
@@ -3257,6 +3448,35 @@ done:
 		elv_schedule_dispatch(q);
 }
 
+/*
+ * The process associted with ioq (in case of cfq), is going away. Mark it
+ * for deletion.
+ */
+void elv_exit_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	/*
+	 * Async ioq's belong to io group and are cleaned up once group is
+	 * being deleted. Not need to do any cleanup here even if cfq has
+	 * dropped the reference to the queue
+	 */
+	if (!elv_ioq_sync(ioq))
+		return;
+
+	/*
+ 	 * This queue is still under service. Just mark it so that once all
+	 * the IO from queue is done, it is not put back in idle tree.
+	 */
+	if (entity->on_st) {
+		entity->exiting = 1;
+		return;
+	} else if(entity->on_idle_st) {
+		/* Remove ioq from idle tree */
+		dequeue_io_entity_idle(entity);
+	}
+}
+EXPORT_SYMBOL(elv_exit_ioq);
 static void elv_slab_kill(void)
 {
 	/*
Index: linux16/block/cfq-iosched.c
===================================================================
--- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
@@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
 		elv_schedule_dispatch(cfqd->queue);
 	}
 
+	elv_exit_ioq(cfqq->ioq);
 	cfq_put_queue(cfqq);
 }
 
@@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
 		 */
 		if (iog != __iog) {
 			cic_set_cfqq(cic, NULL, 1);
+			elv_exit_ioq(sync_cfqq->ioq);
 			cfq_put_queue(sync_cfqq);
 		}
 	}
Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
@@ -33,6 +33,10 @@ struct io_service_tree {
 	u64 min_vdisktime;
 	struct rb_node *rb_leftmost;
 	unsigned int nr_active;
+
+        /* A cache of io entities which were served and expired */
+        struct rb_root idle;
+        struct rb_node *rb_leftmost_idle;
 };
 
 struct io_sched_data {
@@ -44,9 +48,12 @@ struct io_sched_data {
 struct io_entity {
 	struct rb_node rb_node;
 	int on_st;
+	int on_idle_st;
 	u64 vdisktime;
 	unsigned int weight;
 	struct io_entity *parent;
+	/* This io entity (queue or group) has been marked for deletion */
+	unsigned int exiting;
 
 	struct io_sched_data *my_sd;
 	struct io_service_tree *st;
@@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
 extern int elv_iog_should_idle(struct io_queue *ioq);
-
+extern void elv_exit_ioq(struct io_queue *ioq);
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
 elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel


o I found an issue during test and that is if there is a mix of queue and group
  at same level, there can be fairness issue. For example, consider following
  case.

			root
			/ \
		       T1  G1
			    |
			    T2

 T1 and T2 are two tasks with prio 0 and 7 respectively and G1 is the group
 with weight 900. 

 Task T1 prio 0 is mapped to weight 900 and it will get slice length of 180ms
 and then queue will expire and will be put after G1. (Note, in case of reader
 most liekly next request will come after queue expiry hence queue will be
 deleted and once the request comes in again, it will added to tree fresh. A
 fresh queue is added at the end of the tree. So it will be put after G1.).

 Now G1 will get to run (effectivly T2 will run), T2 has prio 7, which will
 map to weight 200 and get slice length of 40ms and will expire after that. Now
 G1 will a new vtime which is effectively charge of 40ms.

 Now to get fairness G1 should run more but instead T1 will be running as we
 gave it a vtime, same as G1. 

 The core issue here is that for readers, when slice expires, queue is empty
 and not backlogged hence it gets deleted from the tree. Because CFQ only
 operates in flat mode, it did a smart thing and did not keep a track of
 history. Instead it provides slice lenghts according to prio and if in one
 round of dispatch one gets fairness it is fine, otherwise upon queue expiry
 you will be placed at the end of service tree.

 This does not work in hierarchical setups where group's slice lenght is
 determined not by group' weight but by the weight of the queue which will
 run under the group.

 Hence we need to keep track of histroy and assign a new vtime based on disk
 time used by the current queue at the time of expiry.

 But here io scheduler is little different from CFS that at the time of expiry
 most of the time reader's queue is empty. So one will end up deleting it from
 the service tree and next request comes with-in 1 ms and it gets into the tree
 again like a new process.

 So we need to keep track of process io queue's vdisktime, even it after got
 deleted from io scheduler's service tree and use that same vdisktime if that
 queue gets backlogged again. But trusting a ioq's vdisktime is bad because
 it can lead to issues if a service tree min_vtime wrap around takes place 
 between two requests of the queue. (Agreed that it can be not that easy to
 hit but it is possible).

 Hence, keep a cache of io queues serviced recently and when a queue gets
 backlogged, if it is found in cache, use that vdisktime otherwise assign
 a new vdisktime. This cache of io queues (idle tree), is basically the idea
 implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
 bringing it back. (Now I understand it better. :-)).

 There is one good side affect of keeping the cache of recently service io
 queues. Now CFQ can differentiate between streaming readers and new processes
 doing IO. Now for a new queue (which is not in the cache), we can assign a
 lower vdisktime and for a streaming reader, we assign vdisktime based on disk
 time used. This way small file readers or the processes doing small amount
 of IO will have reduced latencies at the cost of little reduced throughput of
 streaming readers.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 
 block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |    9 +
 3 files changed, 246 insertions(+), 17 deletions(-)

Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
@@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
 #define elv_log_entity(entity, fmt, args...)
 #endif
 
+static void check_idle_tree_release(struct io_service_tree *st);
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
 	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
 }
 
+static inline int vdisktime_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
 {
 	s64 delta = (s64)(vdisktime - min_vdisktime);
@@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
 	}
 
 	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	check_idle_tree_release(st);
 }
 
 static inline struct io_entity *parent_entity(struct io_entity *entity)
@@ -411,27 +419,46 @@ static void place_entity(struct io_servi
 	struct rb_node *parent;
 	struct io_entity *entry;
 	int nr_active = st->nr_active - 1;
+	struct io_queue *ioq = ioq_of(entity);
+	int sync = 1;
+
+	if (ioq)
+		sync = elv_ioq_sync(ioq);
+
+	if (add_front || !nr_active) {
+		vdisktime = st->min_vdisktime;
+		goto done;
+	}
+
+	if (sync && entity->vdisktime
+	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
+		/* vdisktime still in future. Use old vdisktime */
+		vdisktime = entity->vdisktime;
+		goto done;
+	}
 
 	/*
-	 * Currently put entity at the end of last entity. This probably will
-	 * require adjustments as we move along
+	 * Effectively a new queue. Assign sync queue a lower vdisktime so
+	 * we can achieve better latencies for small file readers. For async
+	 * queues, put them at the end of the existing queue.
+	 * Group entities are always considered sync.
 	 */
-	if (io_entity_class_idle(entity)) {
-		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime += entry->vdisktime;
-		}
-	} else if (!add_front && nr_active) {
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime = entry->vdisktime;
-		}
-	} else
+	if (sync) {
 		vdisktime = st->min_vdisktime;
+		goto done;
+	}
 
+	/*
+	 * Put entity at the end of the tree. Effectively async queues use
+	 * this path.
+	 */
+	parent = rb_last(&st->active);
+	if (parent) {
+		entry = rb_entry(parent, struct io_entity, rb_node);
+		vdisktime = entry->vdisktime;
+	} else
+		vdisktime = st->min_vdisktime;
+done:
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
 	elv_log_entity(entity, "place_entity: vdisktime=%llu"
 			" min_vdisktime=%llu", entity->vdisktime,
@@ -447,6 +474,122 @@ static inline void io_entity_update_prio
 		 */
 		init_io_entity_service_tree(entity, parent_entity(entity));
 		entity->ioprio_changed = 0;
+
+		/*
+		 * Assign this entity a fresh vdisktime instead of using
+		 * previous one as prio class will lead to service tree
+		 * change and this vdisktime will not be valid on new
+		 * service tree.
+		 *
+		 * TODO: Handle the case of only prio change.
+		 */
+		entity->vdisktime = 0;
+	}
+}
+
+static void
+__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	if (st->rb_leftmost_idle == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost_idle = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->idle);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+
+	__dequeue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 0;
+	if (ioq)
+		elv_put_ioq(ioq);
+}
+
+static void
+__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->idle.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
+			node = &parent->rb_left;
+		else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost_idle = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->idle);
+}
+
+static void enqueue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+	struct io_group *parent_iog;
+
+	/*
+	 * Don't put an entity on idle tree if it has been marked for deletion.
+	 * We are not expecting more io from this entity. No need to cache it
+	 */
+
+	if (entity->exiting)
+		return;
+
+	/*
+	 * If parent group is exiting, don't put on idle tree. May be task got
+	 * moved to a different cgroup and original cgroup got deleted
+	 */
+	parent_iog = iog_of(parent_entity(entity));
+	if (parent_iog->entity.exiting)
+		return;
+
+	if (ioq)
+		elv_get_ioq(ioq);
+	__enqueue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 1;
+}
+
+static void check_idle_tree_release(struct io_service_tree *st)
+{
+	struct io_entity *leftmost;
+
+	if (!st->rb_leftmost_idle)
+		return;
+
+	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
+
+	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
+		dequeue_io_entity_idle(leftmost);
+}
+
+static void flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	while(st->rb_leftmost_idle) {
+		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
+					rb_node);
+		dequeue_io_entity_idle(entity);
 	}
 }
 
@@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
 	st->nr_active--;
 	sd->nr_active--;
 	debug_update_stats_dequeue(entity);
+
+	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
+		enqueue_io_entity_idle(entity);
 }
 
 static void
@@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+	else
+		/*
+		 * This entity was not in idle tree cache. Zero out vdisktime
+		 * so that we don't rely on old vdisktime instead assign a
+		 * fresh one.
+		 */
+		entity->vdisktime = 0;
+
 	io_entity_update_prio(entity);
 	st = entity->st;
 	st->nr_active++;
@@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
+	entity->vdisktime = 0;
+
 	if (add_front) {
 		next_entity = __lookup_next_io_entity(st);
 
@@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
 {
 	struct io_group *iog = e->efqd->root_group;
 	struct io_cgroup *iocg = &io_root_cgroup;
+	struct io_service_tree *st;
+	int i;
 
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	put_io_group_queues(e, iog);
 	elv_put_iog(iog);
 }
@@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
  */
 static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
 {
+	struct io_service_tree *st;
+	int i;
+	struct io_entity *entity = &iog->entity;
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue which is removed from active
+	 * tree will not be put in to idle tree.
+	 */
+	entity->exiting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	hlist_del(&iog->elv_data_node);
 	put_io_group_queues(efqd->eq, iog);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_service_tree *st;
+	int i;
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
 	put_io_group_queues(e, iog);
 	kfree(iog);
 }
@@ -3257,6 +3448,35 @@ done:
 		elv_schedule_dispatch(q);
 }
 
+/*
+ * The process associted with ioq (in case of cfq), is going away. Mark it
+ * for deletion.
+ */
+void elv_exit_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	/*
+	 * Async ioq's belong to io group and are cleaned up once group is
+	 * being deleted. Not need to do any cleanup here even if cfq has
+	 * dropped the reference to the queue
+	 */
+	if (!elv_ioq_sync(ioq))
+		return;
+
+	/*
+ 	 * This queue is still under service. Just mark it so that once all
+	 * the IO from queue is done, it is not put back in idle tree.
+	 */
+	if (entity->on_st) {
+		entity->exiting = 1;
+		return;
+	} else if(entity->on_idle_st) {
+		/* Remove ioq from idle tree */
+		dequeue_io_entity_idle(entity);
+	}
+}
+EXPORT_SYMBOL(elv_exit_ioq);
 static void elv_slab_kill(void)
 {
 	/*
Index: linux16/block/cfq-iosched.c
===================================================================
--- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
@@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
 		elv_schedule_dispatch(cfqd->queue);
 	}
 
+	elv_exit_ioq(cfqq->ioq);
 	cfq_put_queue(cfqq);
 }
 
@@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
 		 */
 		if (iog != __iog) {
 			cic_set_cfqq(cic, NULL, 1);
+			elv_exit_ioq(sync_cfqq->ioq);
 			cfq_put_queue(sync_cfqq);
 		}
 	}
Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
@@ -33,6 +33,10 @@ struct io_service_tree {
 	u64 min_vdisktime;
 	struct rb_node *rb_leftmost;
 	unsigned int nr_active;
+
+        /* A cache of io entities which were served and expired */
+        struct rb_root idle;
+        struct rb_node *rb_leftmost_idle;
 };
 
 struct io_sched_data {
@@ -44,9 +48,12 @@ struct io_sched_data {
 struct io_entity {
 	struct rb_node rb_node;
 	int on_st;
+	int on_idle_st;
 	u64 vdisktime;
 	unsigned int weight;
 	struct io_entity *parent;
+	/* This io entity (queue or group) has been marked for deletion */
+	unsigned int exiting;
 
 	struct io_sched_data *my_sd;
 	struct io_service_tree *st;
@@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
 extern int elv_iog_should_idle(struct io_queue *ioq);
-
+extern void elv_exit_ioq(struct io_queue *ioq);
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
 elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 25/23] io-controller: fix queue vs group fairness
@ 2009-09-08 22:28   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers, akpm,
	righi.andrea, torvalds


o I found an issue during test and that is if there is a mix of queue and group
  at same level, there can be fairness issue. For example, consider following
  case.

			root
			/ \
		       T1  G1
			    |
			    T2

 T1 and T2 are two tasks with prio 0 and 7 respectively and G1 is the group
 with weight 900. 

 Task T1 prio 0 is mapped to weight 900 and it will get slice length of 180ms
 and then queue will expire and will be put after G1. (Note, in case of reader
 most liekly next request will come after queue expiry hence queue will be
 deleted and once the request comes in again, it will added to tree fresh. A
 fresh queue is added at the end of the tree. So it will be put after G1.).

 Now G1 will get to run (effectivly T2 will run), T2 has prio 7, which will
 map to weight 200 and get slice length of 40ms and will expire after that. Now
 G1 will a new vtime which is effectively charge of 40ms.

 Now to get fairness G1 should run more but instead T1 will be running as we
 gave it a vtime, same as G1. 

 The core issue here is that for readers, when slice expires, queue is empty
 and not backlogged hence it gets deleted from the tree. Because CFQ only
 operates in flat mode, it did a smart thing and did not keep a track of
 history. Instead it provides slice lenghts according to prio and if in one
 round of dispatch one gets fairness it is fine, otherwise upon queue expiry
 you will be placed at the end of service tree.

 This does not work in hierarchical setups where group's slice lenght is
 determined not by group' weight but by the weight of the queue which will
 run under the group.

 Hence we need to keep track of histroy and assign a new vtime based on disk
 time used by the current queue at the time of expiry.

 But here io scheduler is little different from CFS that at the time of expiry
 most of the time reader's queue is empty. So one will end up deleting it from
 the service tree and next request comes with-in 1 ms and it gets into the tree
 again like a new process.

 So we need to keep track of process io queue's vdisktime, even it after got
 deleted from io scheduler's service tree and use that same vdisktime if that
 queue gets backlogged again. But trusting a ioq's vdisktime is bad because
 it can lead to issues if a service tree min_vtime wrap around takes place 
 between two requests of the queue. (Agreed that it can be not that easy to
 hit but it is possible).

 Hence, keep a cache of io queues serviced recently and when a queue gets
 backlogged, if it is found in cache, use that vdisktime otherwise assign
 a new vdisktime. This cache of io queues (idle tree), is basically the idea
 implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
 bringing it back. (Now I understand it better. :-)).

 There is one good side affect of keeping the cache of recently service io
 queues. Now CFQ can differentiate between streaming readers and new processes
 doing IO. Now for a new queue (which is not in the cache), we can assign a
 lower vdisktime and for a streaming reader, we assign vdisktime based on disk
 time used. This way small file readers or the processes doing small amount
 of IO will have reduced latencies at the cost of little reduced throughput of
 streaming readers.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 
 block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |    9 +
 3 files changed, 246 insertions(+), 17 deletions(-)

Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
@@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
 #define elv_log_entity(entity, fmt, args...)
 #endif
 
+static void check_idle_tree_release(struct io_service_tree *st);
+
 static inline struct io_queue *ioq_of(struct io_entity *entity)
 {
 	if (entity->my_sd == NULL)
@@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
 	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
 }
 
+static inline int vdisktime_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
 {
 	s64 delta = (s64)(vdisktime - min_vdisktime);
@@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
 	}
 
 	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+	check_idle_tree_release(st);
 }
 
 static inline struct io_entity *parent_entity(struct io_entity *entity)
@@ -411,27 +419,46 @@ static void place_entity(struct io_servi
 	struct rb_node *parent;
 	struct io_entity *entry;
 	int nr_active = st->nr_active - 1;
+	struct io_queue *ioq = ioq_of(entity);
+	int sync = 1;
+
+	if (ioq)
+		sync = elv_ioq_sync(ioq);
+
+	if (add_front || !nr_active) {
+		vdisktime = st->min_vdisktime;
+		goto done;
+	}
+
+	if (sync && entity->vdisktime
+	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
+		/* vdisktime still in future. Use old vdisktime */
+		vdisktime = entity->vdisktime;
+		goto done;
+	}
 
 	/*
-	 * Currently put entity at the end of last entity. This probably will
-	 * require adjustments as we move along
+	 * Effectively a new queue. Assign sync queue a lower vdisktime so
+	 * we can achieve better latencies for small file readers. For async
+	 * queues, put them at the end of the existing queue.
+	 * Group entities are always considered sync.
 	 */
-	if (io_entity_class_idle(entity)) {
-		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime += entry->vdisktime;
-		}
-	} else if (!add_front && nr_active) {
-		parent = rb_last(&st->active);
-		if (parent) {
-			entry = rb_entry(parent, struct io_entity, rb_node);
-			vdisktime = entry->vdisktime;
-		}
-	} else
+	if (sync) {
 		vdisktime = st->min_vdisktime;
+		goto done;
+	}
 
+	/*
+	 * Put entity at the end of the tree. Effectively async queues use
+	 * this path.
+	 */
+	parent = rb_last(&st->active);
+	if (parent) {
+		entry = rb_entry(parent, struct io_entity, rb_node);
+		vdisktime = entry->vdisktime;
+	} else
+		vdisktime = st->min_vdisktime;
+done:
 	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
 	elv_log_entity(entity, "place_entity: vdisktime=%llu"
 			" min_vdisktime=%llu", entity->vdisktime,
@@ -447,6 +474,122 @@ static inline void io_entity_update_prio
 		 */
 		init_io_entity_service_tree(entity, parent_entity(entity));
 		entity->ioprio_changed = 0;
+
+		/*
+		 * Assign this entity a fresh vdisktime instead of using
+		 * previous one as prio class will lead to service tree
+		 * change and this vdisktime will not be valid on new
+		 * service tree.
+		 *
+		 * TODO: Handle the case of only prio change.
+		 */
+		entity->vdisktime = 0;
+	}
+}
+
+static void
+__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	if (st->rb_leftmost_idle == &entity->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&entity->rb_node);
+		st->rb_leftmost_idle = next_node;
+	}
+
+	rb_erase(&entity->rb_node, &st->idle);
+	RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+
+	__dequeue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 0;
+	if (ioq)
+		elv_put_ioq(ioq);
+}
+
+static void
+__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+	struct rb_node **node = &st->idle.rb_node;
+	struct rb_node *parent = NULL;
+	struct io_entity *entry;
+	int leftmost = 1;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
+			node = &parent->rb_left;
+		else {
+			node = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used)
+	 */
+	if (leftmost)
+		st->rb_leftmost_idle = &entity->rb_node;
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, &st->idle);
+}
+
+static void enqueue_io_entity_idle(struct io_entity *entity)
+{
+	struct io_queue *ioq = ioq_of(entity);
+	struct io_group *parent_iog;
+
+	/*
+	 * Don't put an entity on idle tree if it has been marked for deletion.
+	 * We are not expecting more io from this entity. No need to cache it
+	 */
+
+	if (entity->exiting)
+		return;
+
+	/*
+	 * If parent group is exiting, don't put on idle tree. May be task got
+	 * moved to a different cgroup and original cgroup got deleted
+	 */
+	parent_iog = iog_of(parent_entity(entity));
+	if (parent_iog->entity.exiting)
+		return;
+
+	if (ioq)
+		elv_get_ioq(ioq);
+	__enqueue_io_entity_idle(entity->st, entity);
+	entity->on_idle_st = 1;
+}
+
+static void check_idle_tree_release(struct io_service_tree *st)
+{
+	struct io_entity *leftmost;
+
+	if (!st->rb_leftmost_idle)
+		return;
+
+	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
+
+	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
+		dequeue_io_entity_idle(leftmost);
+}
+
+static void flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	while(st->rb_leftmost_idle) {
+		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
+					rb_node);
+		dequeue_io_entity_idle(entity);
 	}
 }
 
@@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
 	st->nr_active--;
 	sd->nr_active--;
 	debug_update_stats_dequeue(entity);
+
+	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
+		enqueue_io_entity_idle(entity);
 }
 
 static void
@@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+	else
+		/*
+		 * This entity was not in idle tree cache. Zero out vdisktime
+		 * so that we don't rely on old vdisktime instead assign a
+		 * fresh one.
+		 */
+		entity->vdisktime = 0;
+
 	io_entity_update_prio(entity);
 	st = entity->st;
 	st->nr_active++;
@@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
 	struct io_service_tree *st = entity->st;
 	struct io_entity *next_entity;
 
+	entity->vdisktime = 0;
+
 	if (add_front) {
 		next_entity = __lookup_next_io_entity(st);
 
@@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
 {
 	struct io_group *iog = e->efqd->root_group;
 	struct io_cgroup *iocg = &io_root_cgroup;
+	struct io_service_tree *st;
+	int i;
 
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	put_io_group_queues(e, iog);
 	elv_put_iog(iog);
 }
@@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
  */
 static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
 {
+	struct io_service_tree *st;
+	int i;
+	struct io_entity *entity = &iog->entity;
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue which is removed from active
+	 * tree will not be put in to idle tree.
+	 */
+	entity->exiting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
+
 	hlist_del(&iog->elv_data_node);
 	put_io_group_queues(efqd->eq, iog);
 
+	if (entity->on_idle_st)
+		dequeue_io_entity_idle(entity);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_service_tree *st;
+	int i;
 
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		flush_idle_tree(st);
+	}
 	put_io_group_queues(e, iog);
 	kfree(iog);
 }
@@ -3257,6 +3448,35 @@ done:
 		elv_schedule_dispatch(q);
 }
 
+/*
+ * The process associted with ioq (in case of cfq), is going away. Mark it
+ * for deletion.
+ */
+void elv_exit_ioq(struct io_queue *ioq)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	/*
+	 * Async ioq's belong to io group and are cleaned up once group is
+	 * being deleted. Not need to do any cleanup here even if cfq has
+	 * dropped the reference to the queue
+	 */
+	if (!elv_ioq_sync(ioq))
+		return;
+
+	/*
+ 	 * This queue is still under service. Just mark it so that once all
+	 * the IO from queue is done, it is not put back in idle tree.
+	 */
+	if (entity->on_st) {
+		entity->exiting = 1;
+		return;
+	} else if(entity->on_idle_st) {
+		/* Remove ioq from idle tree */
+		dequeue_io_entity_idle(entity);
+	}
+}
+EXPORT_SYMBOL(elv_exit_ioq);
 static void elv_slab_kill(void)
 {
 	/*
Index: linux16/block/cfq-iosched.c
===================================================================
--- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
@@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
 		elv_schedule_dispatch(cfqd->queue);
 	}
 
+	elv_exit_ioq(cfqq->ioq);
 	cfq_put_queue(cfqq);
 }
 
@@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
 		 */
 		if (iog != __iog) {
 			cic_set_cfqq(cic, NULL, 1);
+			elv_exit_ioq(sync_cfqq->ioq);
 			cfq_put_queue(sync_cfqq);
 		}
 	}
Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
@@ -33,6 +33,10 @@ struct io_service_tree {
 	u64 min_vdisktime;
 	struct rb_node *rb_leftmost;
 	unsigned int nr_active;
+
+        /* A cache of io entities which were served and expired */
+        struct rb_root idle;
+        struct rb_node *rb_leftmost_idle;
 };
 
 struct io_sched_data {
@@ -44,9 +48,12 @@ struct io_sched_data {
 struct io_entity {
 	struct rb_node rb_node;
 	int on_st;
+	int on_idle_st;
 	u64 vdisktime;
 	unsigned int weight;
 	struct io_entity *parent;
+	/* This io entity (queue or group) has been marked for deletion */
+	unsigned int exiting;
 
 	struct io_sched_data *my_sd;
 	struct io_service_tree *st;
@@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
 extern void elv_free_ioq(struct io_queue *ioq);
 extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
 extern int elv_iog_should_idle(struct io_queue *ioq);
-
+extern void elv_exit_ioq(struct io_queue *ioq);
 #else /* CONFIG_ELV_FAIR_QUEUING */
 static inline struct elv_fq_data *
 elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 26/23] io-controller: fix writer preemption with in a group
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (28 preceding siblings ...)
  2009-09-08 22:28   ` [PATCH 25/23] io-controller: fix queue vs group fairness Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  2009-09-10 15:18   ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
  30 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


o Found another issue during testing. Consider following hierarchy.

			root
			/ \
		       R1  G1
			  /\
			 R2 W

  Generally in CFQ when readers and writers are running, reader immediately
  preempts writers and hence reader gets the better bandwidth. In case of
  hierarchical setup, it becomes little more tricky. In above diagram, G1
  is a group and R1, R2 are readers and W is writer tasks.

  Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
  time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
  again in group G1, (streaming reader). But it will not preempt R1 as R1 is
  also a reader and also because preemption across group is not allowed for
  isolation reasons. Hence R2 will get backlogged in G1 and will get a 
  vdisktime much higher than W. So when G2 gets scheduled again, W will get
  to run its full slice length despite the fact R2 is queue on same service
  tree.

  The core issue here is that apart from regular preemptions (preemption 
  across classes), CFQ also has this special notion of preemption with-in
  class and that can lead to issues active task is running in a differnt
  group than where new queue gets backlogged.

  To solve the issue keep a track of this event (I am calling it late
  preemption). When a group becomes eligible to run again, if late_preemption
  is set, check if there are sync readers backlogged, and if yes, expire the
  writer after one round of dispatch.

  This solves the issue of reader not getting enough bandwidth in hierarchical
  setups.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  125 ++++++++++++++++++++++++++++++++++++++++++++++------
 block/elevator-fq.h |    2 
 2 files changed, 114 insertions(+), 13 deletions(-)

Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 16:12:25.000000000 -0400
@@ -43,6 +43,7 @@ struct io_sched_data {
 	struct io_entity *active_entity;
 	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+	int nr_sync;
 };
 
 struct io_entity {
@@ -150,6 +151,7 @@ struct io_group {
 	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
+	int late_preemption;
 };
 
 struct io_policy_node {
Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 16:12:25.000000000 -0400
@@ -236,6 +236,68 @@ io_entity_sched_data(struct io_entity *e
 	return &iog_of(parent_entity(entity))->sched_data;
 }
 
+static inline void set_late_preemption(struct elevator_queue *eq,
+			struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+	struct io_group *new_iog;
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!active_ioq)
+		return;
+
+	/* For the time being, set late preempt only if new queue is sync */
+	if (!elv_ioq_sync(new_ioq))
+		return;
+
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (ioq_to_io_group(active_ioq) != new_iog
+	    && !new_iog->late_preemption) {
+		new_iog->late_preemption = 1;
+		elv_log_ioq(eq->efqd, new_ioq, "set late preempt");
+	}
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+	if (iog->late_preemption) {
+		iog->late_preemption = 0;
+		elv_log_ioq(eq->efqd, ioq, "reset late preempt");
+	}
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!iog->late_preemption)
+		return;
+
+	/*
+	 * If a sync queue got queued in a group where other writers are are
+	 * queued and at the time of queuing some other reader was running
+	 * in anohter group, then this reader will not preempt the reader in
+	 * another group. Side affect of this is that once this group gets
+	 * scheduled, writer will start running and will not get preempted,
+	 * as it should have been.
+	 *
+	 * Don't expire the writer right now otherwise writers might get
+	 * completely starved. Let it just do one dispatch round and then
+	 * expire. Mark the queue for expiry.
+	 */
+	if (!elv_ioq_sync(ioq) && iog->sched_data.nr_sync) {
+		elv_mark_ioq_must_expire(ioq);
+		elv_log_ioq(eq->efqd, ioq, "late preempt, must expire");
+	}
+}
+
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -267,6 +329,20 @@ io_entity_sched_data(struct io_entity *e
 
 	return &efqd->root_group->sched_data;
 }
+
+static inline void set_late_preemption(struct elevator_queue *eq,
+		struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq) { }
+
 #endif /* GROUP_IOSCHED */
 
 static inline void
@@ -620,11 +696,14 @@ static void dequeue_io_entity(struct io_
 {
 	struct io_service_tree *st = entity->st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	__dequeue_io_entity(st, entity);
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync--;
 	debug_update_stats_dequeue(entity);
 
 	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
@@ -669,6 +748,7 @@ static void enqueue_io_entity(struct io_
 {
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	if (entity->on_idle_st)
 		dequeue_io_entity_idle(entity);
@@ -684,6 +764,9 @@ static void enqueue_io_entity(struct io_
 	st = entity->st;
 	st->nr_active++;
 	sd->nr_active++;
+	/* Keep a track of how many sync queues are backlogged on this group */
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
@@ -2796,6 +2879,8 @@ void elv_ioq_slice_expired(struct reques
 	elv_clear_iog_wait_busy_done(iog);
 	elv_clear_ioq_must_expire(ioq);
 
+	if (elv_ioq_sync(ioq))
+		reset_late_preemption(q->elevator, iog, ioq);
 	/*
 	 * Queue got expired before even a single request completed or
 	 * got expired immediately after first request completion. Use
@@ -2853,7 +2938,7 @@ void elv_slice_expired(struct request_qu
  * no or if we aren't sure, a 1 will cause a preemption attempt.
  */
 static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
-			struct request *rq)
+			struct request *rq, int group_wait_req)
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
@@ -2909,6 +2994,14 @@ static int elv_should_preempt(struct req
 	if (iog != new_iog)
 		return 0;
 
+	/*
+	 * New queue belongs to same group as active queue. If we are just
+ 	 * idling on the group (not queue), then let this new queue preempt
+ 	 * the active queue.
+ 	 */
+	if (group_wait_req)
+		return 1;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -2939,9 +3032,10 @@ void elv_ioq_request_add(struct request_
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
 	struct io_group *iog = ioq_to_io_group(ioq);
-	int group_wait = 0;
+	int group_wait_req = 0;
+	struct elevator_queue *eq = q->elevator;
 
-	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+	if (!elv_iosched_fair_queuing_enabled(eq))
 		return;
 
 	BUG_ON(!efqd);
@@ -2955,7 +3049,7 @@ void elv_ioq_request_add(struct request_
 	if (elv_iog_wait_request(iog)) {
 		del_timer(&efqd->idle_slice_timer);
 		elv_clear_iog_wait_request(iog);
-		group_wait = 1;
+		group_wait_req = 1;
 	}
 
 	/*
@@ -2970,7 +3064,7 @@ void elv_ioq_request_add(struct request_
 		return;
 	}
 
-	if (ioq == elv_active_ioq(q->elevator)) {
+	if (ioq == elv_active_ioq(eq)) {
 		/*
 		 * Remember that we saw a request from this process, but
 		 * don't start queuing just yet. Otherwise we risk seeing lots
@@ -2981,7 +3075,7 @@ void elv_ioq_request_add(struct request_
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (group_wait || elv_ioq_wait_request(ioq)) {
+		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -2989,7 +3083,7 @@ void elv_ioq_request_add(struct request_
 			else
 				elv_mark_ioq_must_dispatch(ioq);
 		}
-	} else if (elv_should_preempt(q, ioq, rq)) {
+	} else if (elv_should_preempt(q, ioq, rq, group_wait_req)) {
 		/*
 		 * not the active queue - expire current slice if it is
 		 * idle and has expired it's mean thinktime or this new queue
@@ -2998,13 +3092,15 @@ void elv_ioq_request_add(struct request_
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
-	} else if (group_wait) {
+	} else {
 		/*
-		 * Got a request in the group we were waiting for. Request
-		 * does not belong to active queue and we have not decided
-		 * to preempt the current active queue. Schedule the dispatch.
+		 * Request came in a queue which is not active and we did not
+		 * decide to preempt the active queue. It is possible that
+		 * active queue belonged to a different group and we did not
+		 * allow preemption. Keep a track of this event so that once
+		 * this group is ready to dispatch, we can do some more checks
 		 */
-		elv_schedule_dispatch(q);
+		set_late_preemption(eq, elv_active_ioq(eq), ioq);
 	}
 }
 
@@ -3274,10 +3370,13 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
-	if (ioq)
+	if (ioq) {
 		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
 				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
 				elv_ioq_nr_dispatched(ioq));
+		check_late_preemption(q->elevator, ioq);
+	}
+
 	return ioq;
 }

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 26/23] io-controller: fix writer preemption with in a group
  2009-08-28 21:30 ` Vivek Goyal
@ 2009-09-08 22:28   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel


o Found another issue during testing. Consider following hierarchy.

			root
			/ \
		       R1  G1
			  /\
			 R2 W

  Generally in CFQ when readers and writers are running, reader immediately
  preempts writers and hence reader gets the better bandwidth. In case of
  hierarchical setup, it becomes little more tricky. In above diagram, G1
  is a group and R1, R2 are readers and W is writer tasks.

  Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
  time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
  again in group G1, (streaming reader). But it will not preempt R1 as R1 is
  also a reader and also because preemption across group is not allowed for
  isolation reasons. Hence R2 will get backlogged in G1 and will get a 
  vdisktime much higher than W. So when G2 gets scheduled again, W will get
  to run its full slice length despite the fact R2 is queue on same service
  tree.

  The core issue here is that apart from regular preemptions (preemption 
  across classes), CFQ also has this special notion of preemption with-in
  class and that can lead to issues active task is running in a differnt
  group than where new queue gets backlogged.

  To solve the issue keep a track of this event (I am calling it late
  preemption). When a group becomes eligible to run again, if late_preemption
  is set, check if there are sync readers backlogged, and if yes, expire the
  writer after one round of dispatch.

  This solves the issue of reader not getting enough bandwidth in hierarchical
  setups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  125 ++++++++++++++++++++++++++++++++++++++++++++++------
 block/elevator-fq.h |    2 
 2 files changed, 114 insertions(+), 13 deletions(-)

Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 16:12:25.000000000 -0400
@@ -43,6 +43,7 @@ struct io_sched_data {
 	struct io_entity *active_entity;
 	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+	int nr_sync;
 };
 
 struct io_entity {
@@ -150,6 +151,7 @@ struct io_group {
 	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
+	int late_preemption;
 };
 
 struct io_policy_node {
Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 16:12:25.000000000 -0400
@@ -236,6 +236,68 @@ io_entity_sched_data(struct io_entity *e
 	return &iog_of(parent_entity(entity))->sched_data;
 }
 
+static inline void set_late_preemption(struct elevator_queue *eq,
+			struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+	struct io_group *new_iog;
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!active_ioq)
+		return;
+
+	/* For the time being, set late preempt only if new queue is sync */
+	if (!elv_ioq_sync(new_ioq))
+		return;
+
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (ioq_to_io_group(active_ioq) != new_iog
+	    && !new_iog->late_preemption) {
+		new_iog->late_preemption = 1;
+		elv_log_ioq(eq->efqd, new_ioq, "set late preempt");
+	}
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+	if (iog->late_preemption) {
+		iog->late_preemption = 0;
+		elv_log_ioq(eq->efqd, ioq, "reset late preempt");
+	}
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!iog->late_preemption)
+		return;
+
+	/*
+	 * If a sync queue got queued in a group where other writers are are
+	 * queued and at the time of queuing some other reader was running
+	 * in anohter group, then this reader will not preempt the reader in
+	 * another group. Side affect of this is that once this group gets
+	 * scheduled, writer will start running and will not get preempted,
+	 * as it should have been.
+	 *
+	 * Don't expire the writer right now otherwise writers might get
+	 * completely starved. Let it just do one dispatch round and then
+	 * expire. Mark the queue for expiry.
+	 */
+	if (!elv_ioq_sync(ioq) && iog->sched_data.nr_sync) {
+		elv_mark_ioq_must_expire(ioq);
+		elv_log_ioq(eq->efqd, ioq, "late preempt, must expire");
+	}
+}
+
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -267,6 +329,20 @@ io_entity_sched_data(struct io_entity *e
 
 	return &efqd->root_group->sched_data;
 }
+
+static inline void set_late_preemption(struct elevator_queue *eq,
+		struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq) { }
+
 #endif /* GROUP_IOSCHED */
 
 static inline void
@@ -620,11 +696,14 @@ static void dequeue_io_entity(struct io_
 {
 	struct io_service_tree *st = entity->st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	__dequeue_io_entity(st, entity);
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync--;
 	debug_update_stats_dequeue(entity);
 
 	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
@@ -669,6 +748,7 @@ static void enqueue_io_entity(struct io_
 {
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	if (entity->on_idle_st)
 		dequeue_io_entity_idle(entity);
@@ -684,6 +764,9 @@ static void enqueue_io_entity(struct io_
 	st = entity->st;
 	st->nr_active++;
 	sd->nr_active++;
+	/* Keep a track of how many sync queues are backlogged on this group */
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
@@ -2796,6 +2879,8 @@ void elv_ioq_slice_expired(struct reques
 	elv_clear_iog_wait_busy_done(iog);
 	elv_clear_ioq_must_expire(ioq);
 
+	if (elv_ioq_sync(ioq))
+		reset_late_preemption(q->elevator, iog, ioq);
 	/*
 	 * Queue got expired before even a single request completed or
 	 * got expired immediately after first request completion. Use
@@ -2853,7 +2938,7 @@ void elv_slice_expired(struct request_qu
  * no or if we aren't sure, a 1 will cause a preemption attempt.
  */
 static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
-			struct request *rq)
+			struct request *rq, int group_wait_req)
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
@@ -2909,6 +2994,14 @@ static int elv_should_preempt(struct req
 	if (iog != new_iog)
 		return 0;
 
+	/*
+	 * New queue belongs to same group as active queue. If we are just
+ 	 * idling on the group (not queue), then let this new queue preempt
+ 	 * the active queue.
+ 	 */
+	if (group_wait_req)
+		return 1;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -2939,9 +3032,10 @@ void elv_ioq_request_add(struct request_
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
 	struct io_group *iog = ioq_to_io_group(ioq);
-	int group_wait = 0;
+	int group_wait_req = 0;
+	struct elevator_queue *eq = q->elevator;
 
-	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+	if (!elv_iosched_fair_queuing_enabled(eq))
 		return;
 
 	BUG_ON(!efqd);
@@ -2955,7 +3049,7 @@ void elv_ioq_request_add(struct request_
 	if (elv_iog_wait_request(iog)) {
 		del_timer(&efqd->idle_slice_timer);
 		elv_clear_iog_wait_request(iog);
-		group_wait = 1;
+		group_wait_req = 1;
 	}
 
 	/*
@@ -2970,7 +3064,7 @@ void elv_ioq_request_add(struct request_
 		return;
 	}
 
-	if (ioq == elv_active_ioq(q->elevator)) {
+	if (ioq == elv_active_ioq(eq)) {
 		/*
 		 * Remember that we saw a request from this process, but
 		 * don't start queuing just yet. Otherwise we risk seeing lots
@@ -2981,7 +3075,7 @@ void elv_ioq_request_add(struct request_
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (group_wait || elv_ioq_wait_request(ioq)) {
+		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -2989,7 +3083,7 @@ void elv_ioq_request_add(struct request_
 			else
 				elv_mark_ioq_must_dispatch(ioq);
 		}
-	} else if (elv_should_preempt(q, ioq, rq)) {
+	} else if (elv_should_preempt(q, ioq, rq, group_wait_req)) {
 		/*
 		 * not the active queue - expire current slice if it is
 		 * idle and has expired it's mean thinktime or this new queue
@@ -2998,13 +3092,15 @@ void elv_ioq_request_add(struct request_
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
-	} else if (group_wait) {
+	} else {
 		/*
-		 * Got a request in the group we were waiting for. Request
-		 * does not belong to active queue and we have not decided
-		 * to preempt the current active queue. Schedule the dispatch.
+		 * Request came in a queue which is not active and we did not
+		 * decide to preempt the active queue. It is possible that
+		 * active queue belonged to a different group and we did not
+		 * allow preemption. Keep a track of this event so that once
+		 * this group is ready to dispatch, we can do some more checks
 		 */
-		elv_schedule_dispatch(q);
+		set_late_preemption(eq, elv_active_ioq(eq), ioq);
 	}
 }
 
@@ -3274,10 +3370,13 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
-	if (ioq)
+	if (ioq) {
 		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
 				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
 				elv_ioq_nr_dispatched(ioq));
+		check_late_preemption(q->elevator, ioq);
+	}
+
 	return ioq;
 }
 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH 26/23] io-controller: fix writer preemption with in a group
@ 2009-09-08 22:28   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-08 22:28 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers, akpm,
	righi.andrea, torvalds


o Found another issue during testing. Consider following hierarchy.

			root
			/ \
		       R1  G1
			  /\
			 R2 W

  Generally in CFQ when readers and writers are running, reader immediately
  preempts writers and hence reader gets the better bandwidth. In case of
  hierarchical setup, it becomes little more tricky. In above diagram, G1
  is a group and R1, R2 are readers and W is writer tasks.

  Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
  time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
  again in group G1, (streaming reader). But it will not preempt R1 as R1 is
  also a reader and also because preemption across group is not allowed for
  isolation reasons. Hence R2 will get backlogged in G1 and will get a 
  vdisktime much higher than W. So when G2 gets scheduled again, W will get
  to run its full slice length despite the fact R2 is queue on same service
  tree.

  The core issue here is that apart from regular preemptions (preemption 
  across classes), CFQ also has this special notion of preemption with-in
  class and that can lead to issues active task is running in a differnt
  group than where new queue gets backlogged.

  To solve the issue keep a track of this event (I am calling it late
  preemption). When a group becomes eligible to run again, if late_preemption
  is set, check if there are sync readers backlogged, and if yes, expire the
  writer after one round of dispatch.

  This solves the issue of reader not getting enough bandwidth in hierarchical
  setups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  125 ++++++++++++++++++++++++++++++++++++++++++++++------
 block/elevator-fq.h |    2 
 2 files changed, 114 insertions(+), 13 deletions(-)

Index: linux16/block/elevator-fq.h
===================================================================
--- linux16.orig/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.h	2009-09-08 16:12:25.000000000 -0400
@@ -43,6 +43,7 @@ struct io_sched_data {
 	struct io_entity *active_entity;
 	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+	int nr_sync;
 };
 
 struct io_entity {
@@ -150,6 +151,7 @@ struct io_group {
 	unsigned long state;
 	/* request list associated with the group */
 	struct request_list rl;
+	int late_preemption;
 };
 
 struct io_policy_node {
Index: linux16/block/elevator-fq.c
===================================================================
--- linux16.orig/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
+++ linux16/block/elevator-fq.c	2009-09-08 16:12:25.000000000 -0400
@@ -236,6 +236,68 @@ io_entity_sched_data(struct io_entity *e
 	return &iog_of(parent_entity(entity))->sched_data;
 }
 
+static inline void set_late_preemption(struct elevator_queue *eq,
+			struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+	struct io_group *new_iog;
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!active_ioq)
+		return;
+
+	/* For the time being, set late preempt only if new queue is sync */
+	if (!elv_ioq_sync(new_ioq))
+		return;
+
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (ioq_to_io_group(active_ioq) != new_iog
+	    && !new_iog->late_preemption) {
+		new_iog->late_preemption = 1;
+		elv_log_ioq(eq->efqd, new_ioq, "set late preempt");
+	}
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+	if (iog->late_preemption) {
+		iog->late_preemption = 0;
+		elv_log_ioq(eq->efqd, ioq, "reset late preempt");
+	}
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (elv_iosched_single_ioq(eq))
+		return;
+
+	if (!iog->late_preemption)
+		return;
+
+	/*
+	 * If a sync queue got queued in a group where other writers are are
+	 * queued and at the time of queuing some other reader was running
+	 * in anohter group, then this reader will not preempt the reader in
+	 * another group. Side affect of this is that once this group gets
+	 * scheduled, writer will start running and will not get preempted,
+	 * as it should have been.
+	 *
+	 * Don't expire the writer right now otherwise writers might get
+	 * completely starved. Let it just do one dispatch round and then
+	 * expire. Mark the queue for expiry.
+	 */
+	if (!elv_ioq_sync(ioq) && iog->sched_data.nr_sync) {
+		elv_mark_ioq_must_expire(ioq);
+		elv_log_ioq(eq->efqd, ioq, "late preempt, must expire");
+	}
+}
+
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -267,6 +329,20 @@ io_entity_sched_data(struct io_entity *e
 
 	return &efqd->root_group->sched_data;
 }
+
+static inline void set_late_preemption(struct elevator_queue *eq,
+		struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+				struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq) { }
+
 #endif /* GROUP_IOSCHED */
 
 static inline void
@@ -620,11 +696,14 @@ static void dequeue_io_entity(struct io_
 {
 	struct io_service_tree *st = entity->st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	__dequeue_io_entity(st, entity);
 	entity->on_st = 0;
 	st->nr_active--;
 	sd->nr_active--;
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync--;
 	debug_update_stats_dequeue(entity);
 
 	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
@@ -669,6 +748,7 @@ static void enqueue_io_entity(struct io_
 {
 	struct io_service_tree *st;
 	struct io_sched_data *sd = io_entity_sched_data(entity);
+	struct io_queue *ioq = ioq_of(entity);
 
 	if (entity->on_idle_st)
 		dequeue_io_entity_idle(entity);
@@ -684,6 +764,9 @@ static void enqueue_io_entity(struct io_
 	st = entity->st;
 	st->nr_active++;
 	sd->nr_active++;
+	/* Keep a track of how many sync queues are backlogged on this group */
+	if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+		sd->nr_sync++;
 	entity->on_st = 1;
 	place_entity(st, entity, 0);
 	__enqueue_io_entity(st, entity, 0);
@@ -2796,6 +2879,8 @@ void elv_ioq_slice_expired(struct reques
 	elv_clear_iog_wait_busy_done(iog);
 	elv_clear_ioq_must_expire(ioq);
 
+	if (elv_ioq_sync(ioq))
+		reset_late_preemption(q->elevator, iog, ioq);
 	/*
 	 * Queue got expired before even a single request completed or
 	 * got expired immediately after first request completion. Use
@@ -2853,7 +2938,7 @@ void elv_slice_expired(struct request_qu
  * no or if we aren't sure, a 1 will cause a preemption attempt.
  */
 static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
-			struct request *rq)
+			struct request *rq, int group_wait_req)
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
@@ -2909,6 +2994,14 @@ static int elv_should_preempt(struct req
 	if (iog != new_iog)
 		return 0;
 
+	/*
+	 * New queue belongs to same group as active queue. If we are just
+ 	 * idling on the group (not queue), then let this new queue preempt
+ 	 * the active queue.
+ 	 */
+	if (group_wait_req)
+		return 1;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -2939,9 +3032,10 @@ void elv_ioq_request_add(struct request_
 	struct elv_fq_data *efqd = q->elevator->efqd;
 	struct io_queue *ioq = rq->ioq;
 	struct io_group *iog = ioq_to_io_group(ioq);
-	int group_wait = 0;
+	int group_wait_req = 0;
+	struct elevator_queue *eq = q->elevator;
 
-	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+	if (!elv_iosched_fair_queuing_enabled(eq))
 		return;
 
 	BUG_ON(!efqd);
@@ -2955,7 +3049,7 @@ void elv_ioq_request_add(struct request_
 	if (elv_iog_wait_request(iog)) {
 		del_timer(&efqd->idle_slice_timer);
 		elv_clear_iog_wait_request(iog);
-		group_wait = 1;
+		group_wait_req = 1;
 	}
 
 	/*
@@ -2970,7 +3064,7 @@ void elv_ioq_request_add(struct request_
 		return;
 	}
 
-	if (ioq == elv_active_ioq(q->elevator)) {
+	if (ioq == elv_active_ioq(eq)) {
 		/*
 		 * Remember that we saw a request from this process, but
 		 * don't start queuing just yet. Otherwise we risk seeing lots
@@ -2981,7 +3075,7 @@ void elv_ioq_request_add(struct request_
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (group_wait || elv_ioq_wait_request(ioq)) {
+		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
@@ -2989,7 +3083,7 @@ void elv_ioq_request_add(struct request_
 			else
 				elv_mark_ioq_must_dispatch(ioq);
 		}
-	} else if (elv_should_preempt(q, ioq, rq)) {
+	} else if (elv_should_preempt(q, ioq, rq, group_wait_req)) {
 		/*
 		 * not the active queue - expire current slice if it is
 		 * idle and has expired it's mean thinktime or this new queue
@@ -2998,13 +3092,15 @@ void elv_ioq_request_add(struct request_
 		 */
 		elv_preempt_queue(q, ioq);
 		__blk_run_queue(q);
-	} else if (group_wait) {
+	} else {
 		/*
-		 * Got a request in the group we were waiting for. Request
-		 * does not belong to active queue and we have not decided
-		 * to preempt the current active queue. Schedule the dispatch.
+		 * Request came in a queue which is not active and we did not
+		 * decide to preempt the active queue. It is possible that
+		 * active queue belonged to a different group and we did not
+		 * allow preemption. Keep a track of this event so that once
+		 * this group is ready to dispatch, we can do some more checks
 		 */
-		elv_schedule_dispatch(q);
+		set_late_preemption(eq, elv_active_ioq(eq), ioq);
 	}
 }
 
@@ -3274,10 +3370,13 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
-	if (ioq)
+	if (ioq) {
 		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
 				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
 				elv_ioq_nr_dispatched(ioq));
+		check_late_preemption(q->elevator, ioq);
+	}
+
 	return ioq;
 }
 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found]   ` <20090908222827.GC3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-08 22:37     ` Daniel Walker
  2009-09-08 23:13     ` Fabio Checconi
  2009-09-09  4:44     ` Rik van Riel
  2 siblings, 0 replies; 322+ messages in thread
From: Daniel Walker @ 2009-09-08 22:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


patches 25, and 26 both have checkpatch errors.. Could you run them
through checkpatch and clean up any errors you find?

Daniel

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 22:28   ` Vivek Goyal
  (?)
  (?)
@ 2009-09-08 22:37   ` Daniel Walker
  2009-09-09  1:09     ` Vivek Goyal
  2009-09-09  1:09       ` Vivek Goyal
  -1 siblings, 2 replies; 322+ messages in thread
From: Daniel Walker @ 2009-09-08 22:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo, riel


patches 25, and 26 both have checkpatch errors.. Could you run them
through checkpatch and clean up any errors you find?

Daniel


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found]   ` <20090908222827.GC3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-08 22:37     ` Daniel Walker
@ 2009-09-08 23:13     ` Fabio Checconi
  2009-09-09  4:44     ` Rik van Riel
  2 siblings, 0 replies; 322+ messages in thread
From: Fabio Checconi @ 2009-09-08 23:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Tue, Sep 08, 2009 06:28:27PM -0400
>
> 
> o I found an issue during test and that is if there is a mix of queue and group
...
>  So we need to keep track of process io queue's vdisktime, even it after got
>  deleted from io scheduler's service tree and use that same vdisktime if that
>  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
>  it can lead to issues if a service tree min_vtime wrap around takes place 
>  between two requests of the queue. (Agreed that it can be not that easy to
>  hit but it is possible).
> 
>  Hence, keep a cache of io queues serviced recently and when a queue gets
>  backlogged, if it is found in cache, use that vdisktime otherwise assign
>  a new vdisktime. This cache of io queues (idle tree), is basically the idea
>  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
>  bringing it back. (Now I understand it better. :-)).
> 
>  There is one good side affect of keeping the cache of recently service io
>  queues. Now CFQ can differentiate between streaming readers and new processes
>  doing IO. Now for a new queue (which is not in the cache), we can assign a
>  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
>  time used. This way small file readers or the processes doing small amount
>  of IO will have reduced latencies at the cost of little reduced throughput of
>  streaming readers.
> 

  just a little note: this patch seems to introduce a special case for
vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
time space wraps, so 0 is a perfectly legal value, which can be reached by
service.  I have no idea if it can produce visible effects, but it doesn't
seem to be correct.


> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/cfq-iosched.c |    2 
>  block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  block/elevator-fq.h |    9 +
>  3 files changed, 246 insertions(+), 17 deletions(-)
> 
> Index: linux16/block/elevator-fq.c
> ===================================================================
> --- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
> +++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
> @@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
>  #define elv_log_entity(entity, fmt, args...)
>  #endif
>  
> +static void check_idle_tree_release(struct io_service_tree *st);
> +
>  static inline struct io_queue *ioq_of(struct io_entity *entity)
>  {
>  	if (entity->my_sd == NULL)
> @@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
>  	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
>  }
>  
> +static inline int vdisktime_gt(u64 a, u64 b)
> +{
> +	return (s64)(a - b) > 0;
> +}
> +
>  static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
>  {
>  	s64 delta = (s64)(vdisktime - min_vdisktime);
> @@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
>  	}
>  
>  	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> +	check_idle_tree_release(st);
>  }
>  
>  static inline struct io_entity *parent_entity(struct io_entity *entity)
> @@ -411,27 +419,46 @@ static void place_entity(struct io_servi
>  	struct rb_node *parent;
>  	struct io_entity *entry;
>  	int nr_active = st->nr_active - 1;
> +	struct io_queue *ioq = ioq_of(entity);
> +	int sync = 1;
> +
> +	if (ioq)
> +		sync = elv_ioq_sync(ioq);
> +
> +	if (add_front || !nr_active) {
> +		vdisktime = st->min_vdisktime;
> +		goto done;
> +	}
> +
> +	if (sync && entity->vdisktime
> +	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
> +		/* vdisktime still in future. Use old vdisktime */
> +		vdisktime = entity->vdisktime;
> +		goto done;
> +	}
>  
>  	/*
> -	 * Currently put entity at the end of last entity. This probably will
> -	 * require adjustments as we move along
> +	 * Effectively a new queue. Assign sync queue a lower vdisktime so
> +	 * we can achieve better latencies for small file readers. For async
> +	 * queues, put them at the end of the existing queue.
> +	 * Group entities are always considered sync.
>  	 */
> -	if (io_entity_class_idle(entity)) {
> -		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
> -		parent = rb_last(&st->active);
> -		if (parent) {
> -			entry = rb_entry(parent, struct io_entity, rb_node);
> -			vdisktime += entry->vdisktime;
> -		}
> -	} else if (!add_front && nr_active) {
> -		parent = rb_last(&st->active);
> -		if (parent) {
> -			entry = rb_entry(parent, struct io_entity, rb_node);
> -			vdisktime = entry->vdisktime;
> -		}
> -	} else
> +	if (sync) {
>  		vdisktime = st->min_vdisktime;
> +		goto done;
> +	}
>  
> +	/*
> +	 * Put entity at the end of the tree. Effectively async queues use
> +	 * this path.
> +	 */
> +	parent = rb_last(&st->active);
> +	if (parent) {
> +		entry = rb_entry(parent, struct io_entity, rb_node);
> +		vdisktime = entry->vdisktime;
> +	} else
> +		vdisktime = st->min_vdisktime;
> +done:
>  	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
>  	elv_log_entity(entity, "place_entity: vdisktime=%llu"
>  			" min_vdisktime=%llu", entity->vdisktime,
> @@ -447,6 +474,122 @@ static inline void io_entity_update_prio
>  		 */
>  		init_io_entity_service_tree(entity, parent_entity(entity));
>  		entity->ioprio_changed = 0;
> +
> +		/*
> +		 * Assign this entity a fresh vdisktime instead of using
> +		 * previous one as prio class will lead to service tree
> +		 * change and this vdisktime will not be valid on new
> +		 * service tree.
> +		 *
> +		 * TODO: Handle the case of only prio change.
> +		 */
> +		entity->vdisktime = 0;
> +	}
> +}
> +
> +static void
> +__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> +{
> +	if (st->rb_leftmost_idle == &entity->rb_node) {
> +		struct rb_node *next_node;
> +
> +		next_node = rb_next(&entity->rb_node);
> +		st->rb_leftmost_idle = next_node;
> +	}
> +
> +	rb_erase(&entity->rb_node, &st->idle);
> +	RB_CLEAR_NODE(&entity->rb_node);
> +}
> +
> +static void dequeue_io_entity_idle(struct io_entity *entity)
> +{
> +	struct io_queue *ioq = ioq_of(entity);
> +
> +	__dequeue_io_entity_idle(entity->st, entity);
> +	entity->on_idle_st = 0;
> +	if (ioq)
> +		elv_put_ioq(ioq);
> +}
> +
> +static void
> +__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> +{
> +	struct rb_node **node = &st->idle.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct io_entity *entry;
> +	int leftmost = 1;
> +
> +	while (*node != NULL) {
> +		parent = *node;
> +		entry = rb_entry(parent, struct io_entity, rb_node);
> +
> +		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
> +			node = &parent->rb_left;
> +		else {
> +			node = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	/*
> +	 * Maintain a cache of leftmost tree entries (it is frequently
> +	 * used)
> +	 */
> +	if (leftmost)
> +		st->rb_leftmost_idle = &entity->rb_node;
> +
> +	rb_link_node(&entity->rb_node, parent, node);
> +	rb_insert_color(&entity->rb_node, &st->idle);
> +}
> +
> +static void enqueue_io_entity_idle(struct io_entity *entity)
> +{
> +	struct io_queue *ioq = ioq_of(entity);
> +	struct io_group *parent_iog;
> +
> +	/*
> +	 * Don't put an entity on idle tree if it has been marked for deletion.
> +	 * We are not expecting more io from this entity. No need to cache it
> +	 */
> +
> +	if (entity->exiting)
> +		return;
> +
> +	/*
> +	 * If parent group is exiting, don't put on idle tree. May be task got
> +	 * moved to a different cgroup and original cgroup got deleted
> +	 */
> +	parent_iog = iog_of(parent_entity(entity));
> +	if (parent_iog->entity.exiting)
> +		return;
> +
> +	if (ioq)
> +		elv_get_ioq(ioq);
> +	__enqueue_io_entity_idle(entity->st, entity);
> +	entity->on_idle_st = 1;
> +}
> +
> +static void check_idle_tree_release(struct io_service_tree *st)
> +{
> +	struct io_entity *leftmost;
> +
> +	if (!st->rb_leftmost_idle)
> +		return;
> +
> +	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
> +
> +	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
> +		dequeue_io_entity_idle(leftmost);
> +}
> +
> +static void flush_idle_tree(struct io_service_tree *st)
> +{
> +	struct io_entity *entity;
> +
> +	while(st->rb_leftmost_idle) {
> +		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
> +					rb_node);
> +		dequeue_io_entity_idle(entity);
>  	}
>  }
>  
> @@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
>  	st->nr_active--;
>  	sd->nr_active--;
>  	debug_update_stats_dequeue(entity);
> +
> +	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
> +		enqueue_io_entity_idle(entity);
>  }
>  
>  static void
> @@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
>  	struct io_service_tree *st;
>  	struct io_sched_data *sd = io_entity_sched_data(entity);
>  
> +	if (entity->on_idle_st)
> +		dequeue_io_entity_idle(entity);
> +	else
> +		/*
> +		 * This entity was not in idle tree cache. Zero out vdisktime
> +		 * so that we don't rely on old vdisktime instead assign a
> +		 * fresh one.
> +		 */
> +		entity->vdisktime = 0;
> +
>  	io_entity_update_prio(entity);
>  	st = entity->st;
>  	st->nr_active++;
> @@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
>  	struct io_service_tree *st = entity->st;
>  	struct io_entity *next_entity;
>  
> +	entity->vdisktime = 0;
> +
>  	if (add_front) {
>  		next_entity = __lookup_next_io_entity(st);
>  
> @@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
>  {
>  	struct io_group *iog = e->efqd->root_group;
>  	struct io_cgroup *iocg = &io_root_cgroup;
> +	struct io_service_tree *st;
> +	int i;
>  
>  	spin_lock_irq(&iocg->lock);
>  	hlist_del_rcu(&iog->group_node);
>  	spin_unlock_irq(&iocg->lock);
>  
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
> +
>  	put_io_group_queues(e, iog);
>  	elv_put_iog(iog);
>  }
> @@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
>   */
>  static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
>  {
> +	struct io_service_tree *st;
> +	int i;
> +	struct io_entity *entity = &iog->entity;
> +
> +	/*
> +	 * Mark io group for deletion so that no new entry goes in
> +	 * idle tree. Any active queue which is removed from active
> +	 * tree will not be put in to idle tree.
> +	 */
> +	entity->exiting = 1;
> +
> +	/* We flush idle tree now, and don't put things in there any more. */
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
> +
>  	hlist_del(&iog->elv_data_node);
>  	put_io_group_queues(efqd->eq, iog);
>  
> +	if (entity->on_idle_st)
> +		dequeue_io_entity_idle(entity);
> +
>  	/*
>  	 * Put the reference taken at the time of creation so that when all
>  	 * queues are gone, group can be destroyed.
> @@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
>  static void io_free_root_group(struct elevator_queue *e)
>  {
>  	struct io_group *iog = e->efqd->root_group;
> +	struct io_service_tree *st;
> +	int i;
>  
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
>  	put_io_group_queues(e, iog);
>  	kfree(iog);
>  }
> @@ -3257,6 +3448,35 @@ done:
>  		elv_schedule_dispatch(q);
>  }
>  
> +/*
> + * The process associted with ioq (in case of cfq), is going away. Mark it
> + * for deletion.
> + */
> +void elv_exit_ioq(struct io_queue *ioq)
> +{
> +	struct io_entity *entity = &ioq->entity;
> +
> +	/*
> +	 * Async ioq's belong to io group and are cleaned up once group is
> +	 * being deleted. Not need to do any cleanup here even if cfq has
> +	 * dropped the reference to the queue
> +	 */
> +	if (!elv_ioq_sync(ioq))
> +		return;
> +
> +	/*
> + 	 * This queue is still under service. Just mark it so that once all
> +	 * the IO from queue is done, it is not put back in idle tree.
> +	 */
> +	if (entity->on_st) {
> +		entity->exiting = 1;
> +		return;
> +	} else if(entity->on_idle_st) {
> +		/* Remove ioq from idle tree */
> +		dequeue_io_entity_idle(entity);
> +	}
> +}
> +EXPORT_SYMBOL(elv_exit_ioq);
>  static void elv_slab_kill(void)
>  {
>  	/*
> Index: linux16/block/cfq-iosched.c
> ===================================================================
> --- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
> +++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
> @@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
>  		elv_schedule_dispatch(cfqd->queue);
>  	}
>  
> +	elv_exit_ioq(cfqq->ioq);
>  	cfq_put_queue(cfqq);
>  }
>  
> @@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
>  		 */
>  		if (iog != __iog) {
>  			cic_set_cfqq(cic, NULL, 1);
> +			elv_exit_ioq(sync_cfqq->ioq);
>  			cfq_put_queue(sync_cfqq);
>  		}
>  	}
> Index: linux16/block/elevator-fq.h
> ===================================================================
> --- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
> +++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
> @@ -33,6 +33,10 @@ struct io_service_tree {
>  	u64 min_vdisktime;
>  	struct rb_node *rb_leftmost;
>  	unsigned int nr_active;
> +
> +        /* A cache of io entities which were served and expired */
> +        struct rb_root idle;
> +        struct rb_node *rb_leftmost_idle;
>  };
>  
>  struct io_sched_data {
> @@ -44,9 +48,12 @@ struct io_sched_data {
>  struct io_entity {
>  	struct rb_node rb_node;
>  	int on_st;
> +	int on_idle_st;
>  	u64 vdisktime;
>  	unsigned int weight;
>  	struct io_entity *parent;
> +	/* This io entity (queue or group) has been marked for deletion */
> +	unsigned int exiting;
>  
>  	struct io_sched_data *my_sd;
>  	struct io_service_tree *st;
> @@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
>  extern void elv_free_ioq(struct io_queue *ioq);
>  extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
>  extern int elv_iog_should_idle(struct io_queue *ioq);
> -
> +extern void elv_exit_ioq(struct io_queue *ioq);
>  #else /* CONFIG_ELV_FAIR_QUEUING */
>  static inline struct elv_fq_data *
>  elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 22:28   ` Vivek Goyal
                     ` (2 preceding siblings ...)
  (?)
@ 2009-09-08 23:13   ` Fabio Checconi
  2009-09-09  1:32       ` Vivek Goyal
       [not found]     ` <20090908231334.GJ17468-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  -1 siblings, 2 replies; 322+ messages in thread
From: Fabio Checconi @ 2009-09-08 23:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

Hi,

> From: Vivek Goyal <vgoyal@redhat.com>
> Date: Tue, Sep 08, 2009 06:28:27PM -0400
>
> 
> o I found an issue during test and that is if there is a mix of queue and group
...
>  So we need to keep track of process io queue's vdisktime, even it after got
>  deleted from io scheduler's service tree and use that same vdisktime if that
>  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
>  it can lead to issues if a service tree min_vtime wrap around takes place 
>  between two requests of the queue. (Agreed that it can be not that easy to
>  hit but it is possible).
> 
>  Hence, keep a cache of io queues serviced recently and when a queue gets
>  backlogged, if it is found in cache, use that vdisktime otherwise assign
>  a new vdisktime. This cache of io queues (idle tree), is basically the idea
>  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
>  bringing it back. (Now I understand it better. :-)).
> 
>  There is one good side affect of keeping the cache of recently service io
>  queues. Now CFQ can differentiate between streaming readers and new processes
>  doing IO. Now for a new queue (which is not in the cache), we can assign a
>  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
>  time used. This way small file readers or the processes doing small amount
>  of IO will have reduced latencies at the cost of little reduced throughput of
>  streaming readers.
> 

  just a little note: this patch seems to introduce a special case for
vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
time space wraps, so 0 is a perfectly legal value, which can be reached by
service.  I have no idea if it can produce visible effects, but it doesn't
seem to be correct.


> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    2 
>  block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  block/elevator-fq.h |    9 +
>  3 files changed, 246 insertions(+), 17 deletions(-)
> 
> Index: linux16/block/elevator-fq.c
> ===================================================================
> --- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
> +++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
> @@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
>  #define elv_log_entity(entity, fmt, args...)
>  #endif
>  
> +static void check_idle_tree_release(struct io_service_tree *st);
> +
>  static inline struct io_queue *ioq_of(struct io_entity *entity)
>  {
>  	if (entity->my_sd == NULL)
> @@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
>  	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
>  }
>  
> +static inline int vdisktime_gt(u64 a, u64 b)
> +{
> +	return (s64)(a - b) > 0;
> +}
> +
>  static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
>  {
>  	s64 delta = (s64)(vdisktime - min_vdisktime);
> @@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
>  	}
>  
>  	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> +	check_idle_tree_release(st);
>  }
>  
>  static inline struct io_entity *parent_entity(struct io_entity *entity)
> @@ -411,27 +419,46 @@ static void place_entity(struct io_servi
>  	struct rb_node *parent;
>  	struct io_entity *entry;
>  	int nr_active = st->nr_active - 1;
> +	struct io_queue *ioq = ioq_of(entity);
> +	int sync = 1;
> +
> +	if (ioq)
> +		sync = elv_ioq_sync(ioq);
> +
> +	if (add_front || !nr_active) {
> +		vdisktime = st->min_vdisktime;
> +		goto done;
> +	}
> +
> +	if (sync && entity->vdisktime
> +	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
> +		/* vdisktime still in future. Use old vdisktime */
> +		vdisktime = entity->vdisktime;
> +		goto done;
> +	}
>  
>  	/*
> -	 * Currently put entity at the end of last entity. This probably will
> -	 * require adjustments as we move along
> +	 * Effectively a new queue. Assign sync queue a lower vdisktime so
> +	 * we can achieve better latencies for small file readers. For async
> +	 * queues, put them at the end of the existing queue.
> +	 * Group entities are always considered sync.
>  	 */
> -	if (io_entity_class_idle(entity)) {
> -		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
> -		parent = rb_last(&st->active);
> -		if (parent) {
> -			entry = rb_entry(parent, struct io_entity, rb_node);
> -			vdisktime += entry->vdisktime;
> -		}
> -	} else if (!add_front && nr_active) {
> -		parent = rb_last(&st->active);
> -		if (parent) {
> -			entry = rb_entry(parent, struct io_entity, rb_node);
> -			vdisktime = entry->vdisktime;
> -		}
> -	} else
> +	if (sync) {
>  		vdisktime = st->min_vdisktime;
> +		goto done;
> +	}
>  
> +	/*
> +	 * Put entity at the end of the tree. Effectively async queues use
> +	 * this path.
> +	 */
> +	parent = rb_last(&st->active);
> +	if (parent) {
> +		entry = rb_entry(parent, struct io_entity, rb_node);
> +		vdisktime = entry->vdisktime;
> +	} else
> +		vdisktime = st->min_vdisktime;
> +done:
>  	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
>  	elv_log_entity(entity, "place_entity: vdisktime=%llu"
>  			" min_vdisktime=%llu", entity->vdisktime,
> @@ -447,6 +474,122 @@ static inline void io_entity_update_prio
>  		 */
>  		init_io_entity_service_tree(entity, parent_entity(entity));
>  		entity->ioprio_changed = 0;
> +
> +		/*
> +		 * Assign this entity a fresh vdisktime instead of using
> +		 * previous one as prio class will lead to service tree
> +		 * change and this vdisktime will not be valid on new
> +		 * service tree.
> +		 *
> +		 * TODO: Handle the case of only prio change.
> +		 */
> +		entity->vdisktime = 0;
> +	}
> +}
> +
> +static void
> +__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> +{
> +	if (st->rb_leftmost_idle == &entity->rb_node) {
> +		struct rb_node *next_node;
> +
> +		next_node = rb_next(&entity->rb_node);
> +		st->rb_leftmost_idle = next_node;
> +	}
> +
> +	rb_erase(&entity->rb_node, &st->idle);
> +	RB_CLEAR_NODE(&entity->rb_node);
> +}
> +
> +static void dequeue_io_entity_idle(struct io_entity *entity)
> +{
> +	struct io_queue *ioq = ioq_of(entity);
> +
> +	__dequeue_io_entity_idle(entity->st, entity);
> +	entity->on_idle_st = 0;
> +	if (ioq)
> +		elv_put_ioq(ioq);
> +}
> +
> +static void
> +__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> +{
> +	struct rb_node **node = &st->idle.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct io_entity *entry;
> +	int leftmost = 1;
> +
> +	while (*node != NULL) {
> +		parent = *node;
> +		entry = rb_entry(parent, struct io_entity, rb_node);
> +
> +		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
> +			node = &parent->rb_left;
> +		else {
> +			node = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	/*
> +	 * Maintain a cache of leftmost tree entries (it is frequently
> +	 * used)
> +	 */
> +	if (leftmost)
> +		st->rb_leftmost_idle = &entity->rb_node;
> +
> +	rb_link_node(&entity->rb_node, parent, node);
> +	rb_insert_color(&entity->rb_node, &st->idle);
> +}
> +
> +static void enqueue_io_entity_idle(struct io_entity *entity)
> +{
> +	struct io_queue *ioq = ioq_of(entity);
> +	struct io_group *parent_iog;
> +
> +	/*
> +	 * Don't put an entity on idle tree if it has been marked for deletion.
> +	 * We are not expecting more io from this entity. No need to cache it
> +	 */
> +
> +	if (entity->exiting)
> +		return;
> +
> +	/*
> +	 * If parent group is exiting, don't put on idle tree. May be task got
> +	 * moved to a different cgroup and original cgroup got deleted
> +	 */
> +	parent_iog = iog_of(parent_entity(entity));
> +	if (parent_iog->entity.exiting)
> +		return;
> +
> +	if (ioq)
> +		elv_get_ioq(ioq);
> +	__enqueue_io_entity_idle(entity->st, entity);
> +	entity->on_idle_st = 1;
> +}
> +
> +static void check_idle_tree_release(struct io_service_tree *st)
> +{
> +	struct io_entity *leftmost;
> +
> +	if (!st->rb_leftmost_idle)
> +		return;
> +
> +	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
> +
> +	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
> +		dequeue_io_entity_idle(leftmost);
> +}
> +
> +static void flush_idle_tree(struct io_service_tree *st)
> +{
> +	struct io_entity *entity;
> +
> +	while(st->rb_leftmost_idle) {
> +		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
> +					rb_node);
> +		dequeue_io_entity_idle(entity);
>  	}
>  }
>  
> @@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
>  	st->nr_active--;
>  	sd->nr_active--;
>  	debug_update_stats_dequeue(entity);
> +
> +	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
> +		enqueue_io_entity_idle(entity);
>  }
>  
>  static void
> @@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
>  	struct io_service_tree *st;
>  	struct io_sched_data *sd = io_entity_sched_data(entity);
>  
> +	if (entity->on_idle_st)
> +		dequeue_io_entity_idle(entity);
> +	else
> +		/*
> +		 * This entity was not in idle tree cache. Zero out vdisktime
> +		 * so that we don't rely on old vdisktime instead assign a
> +		 * fresh one.
> +		 */
> +		entity->vdisktime = 0;
> +
>  	io_entity_update_prio(entity);
>  	st = entity->st;
>  	st->nr_active++;
> @@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
>  	struct io_service_tree *st = entity->st;
>  	struct io_entity *next_entity;
>  
> +	entity->vdisktime = 0;
> +
>  	if (add_front) {
>  		next_entity = __lookup_next_io_entity(st);
>  
> @@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
>  {
>  	struct io_group *iog = e->efqd->root_group;
>  	struct io_cgroup *iocg = &io_root_cgroup;
> +	struct io_service_tree *st;
> +	int i;
>  
>  	spin_lock_irq(&iocg->lock);
>  	hlist_del_rcu(&iog->group_node);
>  	spin_unlock_irq(&iocg->lock);
>  
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
> +
>  	put_io_group_queues(e, iog);
>  	elv_put_iog(iog);
>  }
> @@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
>   */
>  static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
>  {
> +	struct io_service_tree *st;
> +	int i;
> +	struct io_entity *entity = &iog->entity;
> +
> +	/*
> +	 * Mark io group for deletion so that no new entry goes in
> +	 * idle tree. Any active queue which is removed from active
> +	 * tree will not be put in to idle tree.
> +	 */
> +	entity->exiting = 1;
> +
> +	/* We flush idle tree now, and don't put things in there any more. */
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
> +
>  	hlist_del(&iog->elv_data_node);
>  	put_io_group_queues(efqd->eq, iog);
>  
> +	if (entity->on_idle_st)
> +		dequeue_io_entity_idle(entity);
> +
>  	/*
>  	 * Put the reference taken at the time of creation so that when all
>  	 * queues are gone, group can be destroyed.
> @@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
>  static void io_free_root_group(struct elevator_queue *e)
>  {
>  	struct io_group *iog = e->efqd->root_group;
> +	struct io_service_tree *st;
> +	int i;
>  
> +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> +		st = iog->sched_data.service_tree + i;
> +		flush_idle_tree(st);
> +	}
>  	put_io_group_queues(e, iog);
>  	kfree(iog);
>  }
> @@ -3257,6 +3448,35 @@ done:
>  		elv_schedule_dispatch(q);
>  }
>  
> +/*
> + * The process associted with ioq (in case of cfq), is going away. Mark it
> + * for deletion.
> + */
> +void elv_exit_ioq(struct io_queue *ioq)
> +{
> +	struct io_entity *entity = &ioq->entity;
> +
> +	/*
> +	 * Async ioq's belong to io group and are cleaned up once group is
> +	 * being deleted. Not need to do any cleanup here even if cfq has
> +	 * dropped the reference to the queue
> +	 */
> +	if (!elv_ioq_sync(ioq))
> +		return;
> +
> +	/*
> + 	 * This queue is still under service. Just mark it so that once all
> +	 * the IO from queue is done, it is not put back in idle tree.
> +	 */
> +	if (entity->on_st) {
> +		entity->exiting = 1;
> +		return;
> +	} else if(entity->on_idle_st) {
> +		/* Remove ioq from idle tree */
> +		dequeue_io_entity_idle(entity);
> +	}
> +}
> +EXPORT_SYMBOL(elv_exit_ioq);
>  static void elv_slab_kill(void)
>  {
>  	/*
> Index: linux16/block/cfq-iosched.c
> ===================================================================
> --- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
> +++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
> @@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
>  		elv_schedule_dispatch(cfqd->queue);
>  	}
>  
> +	elv_exit_ioq(cfqq->ioq);
>  	cfq_put_queue(cfqq);
>  }
>  
> @@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
>  		 */
>  		if (iog != __iog) {
>  			cic_set_cfqq(cic, NULL, 1);
> +			elv_exit_ioq(sync_cfqq->ioq);
>  			cfq_put_queue(sync_cfqq);
>  		}
>  	}
> Index: linux16/block/elevator-fq.h
> ===================================================================
> --- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
> +++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
> @@ -33,6 +33,10 @@ struct io_service_tree {
>  	u64 min_vdisktime;
>  	struct rb_node *rb_leftmost;
>  	unsigned int nr_active;
> +
> +        /* A cache of io entities which were served and expired */
> +        struct rb_root idle;
> +        struct rb_node *rb_leftmost_idle;
>  };
>  
>  struct io_sched_data {
> @@ -44,9 +48,12 @@ struct io_sched_data {
>  struct io_entity {
>  	struct rb_node rb_node;
>  	int on_st;
> +	int on_idle_st;
>  	u64 vdisktime;
>  	unsigned int weight;
>  	struct io_entity *parent;
> +	/* This io entity (queue or group) has been marked for deletion */
> +	unsigned int exiting;
>  
>  	struct io_sched_data *my_sd;
>  	struct io_service_tree *st;
> @@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
>  extern void elv_free_ioq(struct io_queue *ioq);
>  extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
>  extern int elv_iog_should_idle(struct io_queue *ioq);
> -
> +extern void elv_exit_ioq(struct io_queue *ioq);
>  #else /* CONFIG_ELV_FAIR_QUEUING */
>  static inline struct elv_fq_data *
>  elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 22:37   ` Daniel Walker
@ 2009-09-09  1:09     ` Vivek Goyal
  2009-09-09  1:09       ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:09 UTC (permalink / raw)
  To: Daniel Walker
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 08, 2009 at 03:37:07PM -0700, Daniel Walker wrote:
> 
> patches 25, and 26 both have checkpatch errors.. Could you run them
> through checkpatch and clean up any errors you find?
> 

Sure. Will cleanup those errors in V10 of posting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 22:37   ` Daniel Walker
@ 2009-09-09  1:09       ` Vivek Goyal
  2009-09-09  1:09       ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:09 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo, riel

On Tue, Sep 08, 2009 at 03:37:07PM -0700, Daniel Walker wrote:
> 
> patches 25, and 26 both have checkpatch errors.. Could you run them
> through checkpatch and clean up any errors you find?
> 

Sure. Will cleanup those errors in V10 of posting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
@ 2009-09-09  1:09       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:09 UTC (permalink / raw)
  To: Daniel Walker
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Tue, Sep 08, 2009 at 03:37:07PM -0700, Daniel Walker wrote:
> 
> patches 25, and 26 both have checkpatch errors.. Could you run them
> through checkpatch and clean up any errors you find?
> 

Sure. Will cleanup those errors in V10 of posting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found]     ` <20090908231334.GJ17468-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
@ 2009-09-09  1:32       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:32 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 09, 2009 at 01:13:34AM +0200, Fabio Checconi wrote:
> Hi,
> 
> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Date: Tue, Sep 08, 2009 06:28:27PM -0400
> >
> > 
> > o I found an issue during test and that is if there is a mix of queue and group
> ...
> >  So we need to keep track of process io queue's vdisktime, even it after got
> >  deleted from io scheduler's service tree and use that same vdisktime if that
> >  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
> >  it can lead to issues if a service tree min_vtime wrap around takes place 
> >  between two requests of the queue. (Agreed that it can be not that easy to
> >  hit but it is possible).
> > 
> >  Hence, keep a cache of io queues serviced recently and when a queue gets
> >  backlogged, if it is found in cache, use that vdisktime otherwise assign
> >  a new vdisktime. This cache of io queues (idle tree), is basically the idea
> >  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
> >  bringing it back. (Now I understand it better. :-)).
> > 
> >  There is one good side affect of keeping the cache of recently service io
> >  queues. Now CFQ can differentiate between streaming readers and new processes
> >  doing IO. Now for a new queue (which is not in the cache), we can assign a
> >  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
> >  time used. This way small file readers or the processes doing small amount
> >  of IO will have reduced latencies at the cost of little reduced throughput of
> >  streaming readers.
> > 
> 
>   just a little note: this patch seems to introduce a special case for
> vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
> time space wraps, so 0 is a perfectly legal value, which can be reached by
> service.  I have no idea if it can produce visible effects, but it doesn't
> seem to be correct.
> 
> 

Hi Fabio,

You are right that technically during wrap arounds one can hit value 0 as
legal value. But I think it is hard to hit at the same time, the only side
affect of it will be that a queue will be either placed favorably (in case of
sync queues) or at the end of tree (if it is async queue).

Async queues anyway go at the end after every dispatch round. So only side
affect is that once during wrap around cycle a sync queue will be placed
favorably and can gain share once in a dispatch round.

I think it is not a big issue at this point of time. But if it becomes
significant, I can introduce a new variable or start passing function
parameter to denote whether we found the queue in cache or not.

But if you think that it is absolutely no no, let me know....

Thanks
Vivek


> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  block/cfq-iosched.c |    2 
> >  block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  block/elevator-fq.h |    9 +
> >  3 files changed, 246 insertions(+), 17 deletions(-)
> > 
> > Index: linux16/block/elevator-fq.c
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
> > +++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
> >  #define elv_log_entity(entity, fmt, args...)
> >  #endif
> >  
> > +static void check_idle_tree_release(struct io_service_tree *st);
> > +
> >  static inline struct io_queue *ioq_of(struct io_entity *entity)
> >  {
> >  	if (entity->my_sd == NULL)
> > @@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
> >  	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
> >  }
> >  
> > +static inline int vdisktime_gt(u64 a, u64 b)
> > +{
> > +	return (s64)(a - b) > 0;
> > +}
> > +
> >  static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >  {
> >  	s64 delta = (s64)(vdisktime - min_vdisktime);
> > @@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
> >  	}
> >  
> >  	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> > +	check_idle_tree_release(st);
> >  }
> >  
> >  static inline struct io_entity *parent_entity(struct io_entity *entity)
> > @@ -411,27 +419,46 @@ static void place_entity(struct io_servi
> >  	struct rb_node *parent;
> >  	struct io_entity *entry;
> >  	int nr_active = st->nr_active - 1;
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	int sync = 1;
> > +
> > +	if (ioq)
> > +		sync = elv_ioq_sync(ioq);
> > +
> > +	if (add_front || !nr_active) {
> > +		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> > +
> > +	if (sync && entity->vdisktime
> > +	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
> > +		/* vdisktime still in future. Use old vdisktime */
> > +		vdisktime = entity->vdisktime;
> > +		goto done;
> > +	}
> >  
> >  	/*
> > -	 * Currently put entity at the end of last entity. This probably will
> > -	 * require adjustments as we move along
> > +	 * Effectively a new queue. Assign sync queue a lower vdisktime so
> > +	 * we can achieve better latencies for small file readers. For async
> > +	 * queues, put them at the end of the existing queue.
> > +	 * Group entities are always considered sync.
> >  	 */
> > -	if (io_entity_class_idle(entity)) {
> > -		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime += entry->vdisktime;
> > -		}
> > -	} else if (!add_front && nr_active) {
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime = entry->vdisktime;
> > -		}
> > -	} else
> > +	if (sync) {
> >  		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> >  
> > +	/*
> > +	 * Put entity at the end of the tree. Effectively async queues use
> > +	 * this path.
> > +	 */
> > +	parent = rb_last(&st->active);
> > +	if (parent) {
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +		vdisktime = entry->vdisktime;
> > +	} else
> > +		vdisktime = st->min_vdisktime;
> > +done:
> >  	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> >  	elv_log_entity(entity, "place_entity: vdisktime=%llu"
> >  			" min_vdisktime=%llu", entity->vdisktime,
> > @@ -447,6 +474,122 @@ static inline void io_entity_update_prio
> >  		 */
> >  		init_io_entity_service_tree(entity, parent_entity(entity));
> >  		entity->ioprio_changed = 0;
> > +
> > +		/*
> > +		 * Assign this entity a fresh vdisktime instead of using
> > +		 * previous one as prio class will lead to service tree
> > +		 * change and this vdisktime will not be valid on new
> > +		 * service tree.
> > +		 *
> > +		 * TODO: Handle the case of only prio change.
> > +		 */
> > +		entity->vdisktime = 0;
> > +	}
> > +}
> > +
> > +static void
> > +__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	if (st->rb_leftmost_idle == &entity->rb_node) {
> > +		struct rb_node *next_node;
> > +
> > +		next_node = rb_next(&entity->rb_node);
> > +		st->rb_leftmost_idle = next_node;
> > +	}
> > +
> > +	rb_erase(&entity->rb_node, &st->idle);
> > +	RB_CLEAR_NODE(&entity->rb_node);
> > +}
> > +
> > +static void dequeue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +
> > +	__dequeue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 0;
> > +	if (ioq)
> > +		elv_put_ioq(ioq);
> > +}
> > +
> > +static void
> > +__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	struct rb_node **node = &st->idle.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct io_entity *entry;
> > +	int leftmost = 1;
> > +
> > +	while (*node != NULL) {
> > +		parent = *node;
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +
> > +		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
> > +			node = &parent->rb_left;
> > +		else {
> > +			node = &parent->rb_right;
> > +			leftmost = 0;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Maintain a cache of leftmost tree entries (it is frequently
> > +	 * used)
> > +	 */
> > +	if (leftmost)
> > +		st->rb_leftmost_idle = &entity->rb_node;
> > +
> > +	rb_link_node(&entity->rb_node, parent, node);
> > +	rb_insert_color(&entity->rb_node, &st->idle);
> > +}
> > +
> > +static void enqueue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	struct io_group *parent_iog;
> > +
> > +	/*
> > +	 * Don't put an entity on idle tree if it has been marked for deletion.
> > +	 * We are not expecting more io from this entity. No need to cache it
> > +	 */
> > +
> > +	if (entity->exiting)
> > +		return;
> > +
> > +	/*
> > +	 * If parent group is exiting, don't put on idle tree. May be task got
> > +	 * moved to a different cgroup and original cgroup got deleted
> > +	 */
> > +	parent_iog = iog_of(parent_entity(entity));
> > +	if (parent_iog->entity.exiting)
> > +		return;
> > +
> > +	if (ioq)
> > +		elv_get_ioq(ioq);
> > +	__enqueue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 1;
> > +}
> > +
> > +static void check_idle_tree_release(struct io_service_tree *st)
> > +{
> > +	struct io_entity *leftmost;
> > +
> > +	if (!st->rb_leftmost_idle)
> > +		return;
> > +
> > +	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
> > +
> > +	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
> > +		dequeue_io_entity_idle(leftmost);
> > +}
> > +
> > +static void flush_idle_tree(struct io_service_tree *st)
> > +{
> > +	struct io_entity *entity;
> > +
> > +	while(st->rb_leftmost_idle) {
> > +		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
> > +					rb_node);
> > +		dequeue_io_entity_idle(entity);
> >  	}
> >  }
> >  
> > @@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
> >  	st->nr_active--;
> >  	sd->nr_active--;
> >  	debug_update_stats_dequeue(entity);
> > +
> > +	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
> > +		enqueue_io_entity_idle(entity);
> >  }
> >  
> >  static void
> > @@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
> >  	struct io_service_tree *st;
> >  	struct io_sched_data *sd = io_entity_sched_data(entity);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +	else
> > +		/*
> > +		 * This entity was not in idle tree cache. Zero out vdisktime
> > +		 * so that we don't rely on old vdisktime instead assign a
> > +		 * fresh one.
> > +		 */
> > +		entity->vdisktime = 0;
> > +
> >  	io_entity_update_prio(entity);
> >  	st = entity->st;
> >  	st->nr_active++;
> > @@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
> >  	struct io_service_tree *st = entity->st;
> >  	struct io_entity *next_entity;
> >  
> > +	entity->vdisktime = 0;
> > +
> >  	if (add_front) {
> >  		next_entity = __lookup_next_io_entity(st);
> >  
> > @@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> >  	struct io_cgroup *iocg = &io_root_cgroup;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> >  	spin_lock_irq(&iocg->lock);
> >  	hlist_del_rcu(&iog->group_node);
> >  	spin_unlock_irq(&iocg->lock);
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	put_io_group_queues(e, iog);
> >  	elv_put_iog(iog);
> >  }
> > @@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
> >   */
> >  static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
> >  {
> > +	struct io_service_tree *st;
> > +	int i;
> > +	struct io_entity *entity = &iog->entity;
> > +
> > +	/*
> > +	 * Mark io group for deletion so that no new entry goes in
> > +	 * idle tree. Any active queue which is removed from active
> > +	 * tree will not be put in to idle tree.
> > +	 */
> > +	entity->exiting = 1;
> > +
> > +	/* We flush idle tree now, and don't put things in there any more. */
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	hlist_del(&iog->elv_data_node);
> >  	put_io_group_queues(efqd->eq, iog);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +
> >  	/*
> >  	 * Put the reference taken at the time of creation so that when all
> >  	 * queues are gone, group can be destroyed.
> > @@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
> >  static void io_free_root_group(struct elevator_queue *e)
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> >  	put_io_group_queues(e, iog);
> >  	kfree(iog);
> >  }
> > @@ -3257,6 +3448,35 @@ done:
> >  		elv_schedule_dispatch(q);
> >  }
> >  
> > +/*
> > + * The process associted with ioq (in case of cfq), is going away. Mark it
> > + * for deletion.
> > + */
> > +void elv_exit_ioq(struct io_queue *ioq)
> > +{
> > +	struct io_entity *entity = &ioq->entity;
> > +
> > +	/*
> > +	 * Async ioq's belong to io group and are cleaned up once group is
> > +	 * being deleted. Not need to do any cleanup here even if cfq has
> > +	 * dropped the reference to the queue
> > +	 */
> > +	if (!elv_ioq_sync(ioq))
> > +		return;
> > +
> > +	/*
> > + 	 * This queue is still under service. Just mark it so that once all
> > +	 * the IO from queue is done, it is not put back in idle tree.
> > +	 */
> > +	if (entity->on_st) {
> > +		entity->exiting = 1;
> > +		return;
> > +	} else if(entity->on_idle_st) {
> > +		/* Remove ioq from idle tree */
> > +		dequeue_io_entity_idle(entity);
> > +	}
> > +}
> > +EXPORT_SYMBOL(elv_exit_ioq);
> >  static void elv_slab_kill(void)
> >  {
> >  	/*
> > Index: linux16/block/cfq-iosched.c
> > ===================================================================
> > --- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
> >  		elv_schedule_dispatch(cfqd->queue);
> >  	}
> >  
> > +	elv_exit_ioq(cfqq->ioq);
> >  	cfq_put_queue(cfqq);
> >  }
> >  
> > @@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
> >  		 */
> >  		if (iog != __iog) {
> >  			cic_set_cfqq(cic, NULL, 1);
> > +			elv_exit_ioq(sync_cfqq->ioq);
> >  			cfq_put_queue(sync_cfqq);
> >  		}
> >  	}
> > Index: linux16/block/elevator-fq.h
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
> > @@ -33,6 +33,10 @@ struct io_service_tree {
> >  	u64 min_vdisktime;
> >  	struct rb_node *rb_leftmost;
> >  	unsigned int nr_active;
> > +
> > +        /* A cache of io entities which were served and expired */
> > +        struct rb_root idle;
> > +        struct rb_node *rb_leftmost_idle;
> >  };
> >  
> >  struct io_sched_data {
> > @@ -44,9 +48,12 @@ struct io_sched_data {
> >  struct io_entity {
> >  	struct rb_node rb_node;
> >  	int on_st;
> > +	int on_idle_st;
> >  	u64 vdisktime;
> >  	unsigned int weight;
> >  	struct io_entity *parent;
> > +	/* This io entity (queue or group) has been marked for deletion */
> > +	unsigned int exiting;
> >  
> >  	struct io_sched_data *my_sd;
> >  	struct io_service_tree *st;
> > @@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
> >  extern void elv_free_ioq(struct io_queue *ioq);
> >  extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
> >  extern int elv_iog_should_idle(struct io_queue *ioq);
> > -
> > +extern void elv_exit_ioq(struct io_queue *ioq);
> >  #else /* CONFIG_ELV_FAIR_QUEUING */
> >  static inline struct elv_fq_data *
> >  elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 23:13   ` Fabio Checconi
@ 2009-09-09  1:32       ` Vivek Goyal
       [not found]     ` <20090908231334.GJ17468-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:32 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 09, 2009 at 01:13:34AM +0200, Fabio Checconi wrote:
> Hi,
> 
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Date: Tue, Sep 08, 2009 06:28:27PM -0400
> >
> > 
> > o I found an issue during test and that is if there is a mix of queue and group
> ...
> >  So we need to keep track of process io queue's vdisktime, even it after got
> >  deleted from io scheduler's service tree and use that same vdisktime if that
> >  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
> >  it can lead to issues if a service tree min_vtime wrap around takes place 
> >  between two requests of the queue. (Agreed that it can be not that easy to
> >  hit but it is possible).
> > 
> >  Hence, keep a cache of io queues serviced recently and when a queue gets
> >  backlogged, if it is found in cache, use that vdisktime otherwise assign
> >  a new vdisktime. This cache of io queues (idle tree), is basically the idea
> >  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
> >  bringing it back. (Now I understand it better. :-)).
> > 
> >  There is one good side affect of keeping the cache of recently service io
> >  queues. Now CFQ can differentiate between streaming readers and new processes
> >  doing IO. Now for a new queue (which is not in the cache), we can assign a
> >  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
> >  time used. This way small file readers or the processes doing small amount
> >  of IO will have reduced latencies at the cost of little reduced throughput of
> >  streaming readers.
> > 
> 
>   just a little note: this patch seems to introduce a special case for
> vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
> time space wraps, so 0 is a perfectly legal value, which can be reached by
> service.  I have no idea if it can produce visible effects, but it doesn't
> seem to be correct.
> 
> 

Hi Fabio,

You are right that technically during wrap arounds one can hit value 0 as
legal value. But I think it is hard to hit at the same time, the only side
affect of it will be that a queue will be either placed favorably (in case of
sync queues) or at the end of tree (if it is async queue).

Async queues anyway go at the end after every dispatch round. So only side
affect is that once during wrap around cycle a sync queue will be placed
favorably and can gain share once in a dispatch round.

I think it is not a big issue at this point of time. But if it becomes
significant, I can introduce a new variable or start passing function
parameter to denote whether we found the queue in cache or not.

But if you think that it is absolutely no no, let me know....

Thanks
Vivek


> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/cfq-iosched.c |    2 
> >  block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  block/elevator-fq.h |    9 +
> >  3 files changed, 246 insertions(+), 17 deletions(-)
> > 
> > Index: linux16/block/elevator-fq.c
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
> > +++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
> >  #define elv_log_entity(entity, fmt, args...)
> >  #endif
> >  
> > +static void check_idle_tree_release(struct io_service_tree *st);
> > +
> >  static inline struct io_queue *ioq_of(struct io_entity *entity)
> >  {
> >  	if (entity->my_sd == NULL)
> > @@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
> >  	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
> >  }
> >  
> > +static inline int vdisktime_gt(u64 a, u64 b)
> > +{
> > +	return (s64)(a - b) > 0;
> > +}
> > +
> >  static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >  {
> >  	s64 delta = (s64)(vdisktime - min_vdisktime);
> > @@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
> >  	}
> >  
> >  	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> > +	check_idle_tree_release(st);
> >  }
> >  
> >  static inline struct io_entity *parent_entity(struct io_entity *entity)
> > @@ -411,27 +419,46 @@ static void place_entity(struct io_servi
> >  	struct rb_node *parent;
> >  	struct io_entity *entry;
> >  	int nr_active = st->nr_active - 1;
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	int sync = 1;
> > +
> > +	if (ioq)
> > +		sync = elv_ioq_sync(ioq);
> > +
> > +	if (add_front || !nr_active) {
> > +		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> > +
> > +	if (sync && entity->vdisktime
> > +	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
> > +		/* vdisktime still in future. Use old vdisktime */
> > +		vdisktime = entity->vdisktime;
> > +		goto done;
> > +	}
> >  
> >  	/*
> > -	 * Currently put entity at the end of last entity. This probably will
> > -	 * require adjustments as we move along
> > +	 * Effectively a new queue. Assign sync queue a lower vdisktime so
> > +	 * we can achieve better latencies for small file readers. For async
> > +	 * queues, put them at the end of the existing queue.
> > +	 * Group entities are always considered sync.
> >  	 */
> > -	if (io_entity_class_idle(entity)) {
> > -		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime += entry->vdisktime;
> > -		}
> > -	} else if (!add_front && nr_active) {
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime = entry->vdisktime;
> > -		}
> > -	} else
> > +	if (sync) {
> >  		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> >  
> > +	/*
> > +	 * Put entity at the end of the tree. Effectively async queues use
> > +	 * this path.
> > +	 */
> > +	parent = rb_last(&st->active);
> > +	if (parent) {
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +		vdisktime = entry->vdisktime;
> > +	} else
> > +		vdisktime = st->min_vdisktime;
> > +done:
> >  	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> >  	elv_log_entity(entity, "place_entity: vdisktime=%llu"
> >  			" min_vdisktime=%llu", entity->vdisktime,
> > @@ -447,6 +474,122 @@ static inline void io_entity_update_prio
> >  		 */
> >  		init_io_entity_service_tree(entity, parent_entity(entity));
> >  		entity->ioprio_changed = 0;
> > +
> > +		/*
> > +		 * Assign this entity a fresh vdisktime instead of using
> > +		 * previous one as prio class will lead to service tree
> > +		 * change and this vdisktime will not be valid on new
> > +		 * service tree.
> > +		 *
> > +		 * TODO: Handle the case of only prio change.
> > +		 */
> > +		entity->vdisktime = 0;
> > +	}
> > +}
> > +
> > +static void
> > +__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	if (st->rb_leftmost_idle == &entity->rb_node) {
> > +		struct rb_node *next_node;
> > +
> > +		next_node = rb_next(&entity->rb_node);
> > +		st->rb_leftmost_idle = next_node;
> > +	}
> > +
> > +	rb_erase(&entity->rb_node, &st->idle);
> > +	RB_CLEAR_NODE(&entity->rb_node);
> > +}
> > +
> > +static void dequeue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +
> > +	__dequeue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 0;
> > +	if (ioq)
> > +		elv_put_ioq(ioq);
> > +}
> > +
> > +static void
> > +__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	struct rb_node **node = &st->idle.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct io_entity *entry;
> > +	int leftmost = 1;
> > +
> > +	while (*node != NULL) {
> > +		parent = *node;
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +
> > +		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
> > +			node = &parent->rb_left;
> > +		else {
> > +			node = &parent->rb_right;
> > +			leftmost = 0;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Maintain a cache of leftmost tree entries (it is frequently
> > +	 * used)
> > +	 */
> > +	if (leftmost)
> > +		st->rb_leftmost_idle = &entity->rb_node;
> > +
> > +	rb_link_node(&entity->rb_node, parent, node);
> > +	rb_insert_color(&entity->rb_node, &st->idle);
> > +}
> > +
> > +static void enqueue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	struct io_group *parent_iog;
> > +
> > +	/*
> > +	 * Don't put an entity on idle tree if it has been marked for deletion.
> > +	 * We are not expecting more io from this entity. No need to cache it
> > +	 */
> > +
> > +	if (entity->exiting)
> > +		return;
> > +
> > +	/*
> > +	 * If parent group is exiting, don't put on idle tree. May be task got
> > +	 * moved to a different cgroup and original cgroup got deleted
> > +	 */
> > +	parent_iog = iog_of(parent_entity(entity));
> > +	if (parent_iog->entity.exiting)
> > +		return;
> > +
> > +	if (ioq)
> > +		elv_get_ioq(ioq);
> > +	__enqueue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 1;
> > +}
> > +
> > +static void check_idle_tree_release(struct io_service_tree *st)
> > +{
> > +	struct io_entity *leftmost;
> > +
> > +	if (!st->rb_leftmost_idle)
> > +		return;
> > +
> > +	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
> > +
> > +	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
> > +		dequeue_io_entity_idle(leftmost);
> > +}
> > +
> > +static void flush_idle_tree(struct io_service_tree *st)
> > +{
> > +	struct io_entity *entity;
> > +
> > +	while(st->rb_leftmost_idle) {
> > +		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
> > +					rb_node);
> > +		dequeue_io_entity_idle(entity);
> >  	}
> >  }
> >  
> > @@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
> >  	st->nr_active--;
> >  	sd->nr_active--;
> >  	debug_update_stats_dequeue(entity);
> > +
> > +	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
> > +		enqueue_io_entity_idle(entity);
> >  }
> >  
> >  static void
> > @@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
> >  	struct io_service_tree *st;
> >  	struct io_sched_data *sd = io_entity_sched_data(entity);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +	else
> > +		/*
> > +		 * This entity was not in idle tree cache. Zero out vdisktime
> > +		 * so that we don't rely on old vdisktime instead assign a
> > +		 * fresh one.
> > +		 */
> > +		entity->vdisktime = 0;
> > +
> >  	io_entity_update_prio(entity);
> >  	st = entity->st;
> >  	st->nr_active++;
> > @@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
> >  	struct io_service_tree *st = entity->st;
> >  	struct io_entity *next_entity;
> >  
> > +	entity->vdisktime = 0;
> > +
> >  	if (add_front) {
> >  		next_entity = __lookup_next_io_entity(st);
> >  
> > @@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> >  	struct io_cgroup *iocg = &io_root_cgroup;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> >  	spin_lock_irq(&iocg->lock);
> >  	hlist_del_rcu(&iog->group_node);
> >  	spin_unlock_irq(&iocg->lock);
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	put_io_group_queues(e, iog);
> >  	elv_put_iog(iog);
> >  }
> > @@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
> >   */
> >  static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
> >  {
> > +	struct io_service_tree *st;
> > +	int i;
> > +	struct io_entity *entity = &iog->entity;
> > +
> > +	/*
> > +	 * Mark io group for deletion so that no new entry goes in
> > +	 * idle tree. Any active queue which is removed from active
> > +	 * tree will not be put in to idle tree.
> > +	 */
> > +	entity->exiting = 1;
> > +
> > +	/* We flush idle tree now, and don't put things in there any more. */
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	hlist_del(&iog->elv_data_node);
> >  	put_io_group_queues(efqd->eq, iog);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +
> >  	/*
> >  	 * Put the reference taken at the time of creation so that when all
> >  	 * queues are gone, group can be destroyed.
> > @@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
> >  static void io_free_root_group(struct elevator_queue *e)
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> >  	put_io_group_queues(e, iog);
> >  	kfree(iog);
> >  }
> > @@ -3257,6 +3448,35 @@ done:
> >  		elv_schedule_dispatch(q);
> >  }
> >  
> > +/*
> > + * The process associted with ioq (in case of cfq), is going away. Mark it
> > + * for deletion.
> > + */
> > +void elv_exit_ioq(struct io_queue *ioq)
> > +{
> > +	struct io_entity *entity = &ioq->entity;
> > +
> > +	/*
> > +	 * Async ioq's belong to io group and are cleaned up once group is
> > +	 * being deleted. Not need to do any cleanup here even if cfq has
> > +	 * dropped the reference to the queue
> > +	 */
> > +	if (!elv_ioq_sync(ioq))
> > +		return;
> > +
> > +	/*
> > + 	 * This queue is still under service. Just mark it so that once all
> > +	 * the IO from queue is done, it is not put back in idle tree.
> > +	 */
> > +	if (entity->on_st) {
> > +		entity->exiting = 1;
> > +		return;
> > +	} else if(entity->on_idle_st) {
> > +		/* Remove ioq from idle tree */
> > +		dequeue_io_entity_idle(entity);
> > +	}
> > +}
> > +EXPORT_SYMBOL(elv_exit_ioq);
> >  static void elv_slab_kill(void)
> >  {
> >  	/*
> > Index: linux16/block/cfq-iosched.c
> > ===================================================================
> > --- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
> >  		elv_schedule_dispatch(cfqd->queue);
> >  	}
> >  
> > +	elv_exit_ioq(cfqq->ioq);
> >  	cfq_put_queue(cfqq);
> >  }
> >  
> > @@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
> >  		 */
> >  		if (iog != __iog) {
> >  			cic_set_cfqq(cic, NULL, 1);
> > +			elv_exit_ioq(sync_cfqq->ioq);
> >  			cfq_put_queue(sync_cfqq);
> >  		}
> >  	}
> > Index: linux16/block/elevator-fq.h
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
> > @@ -33,6 +33,10 @@ struct io_service_tree {
> >  	u64 min_vdisktime;
> >  	struct rb_node *rb_leftmost;
> >  	unsigned int nr_active;
> > +
> > +        /* A cache of io entities which were served and expired */
> > +        struct rb_root idle;
> > +        struct rb_node *rb_leftmost_idle;
> >  };
> >  
> >  struct io_sched_data {
> > @@ -44,9 +48,12 @@ struct io_sched_data {
> >  struct io_entity {
> >  	struct rb_node rb_node;
> >  	int on_st;
> > +	int on_idle_st;
> >  	u64 vdisktime;
> >  	unsigned int weight;
> >  	struct io_entity *parent;
> > +	/* This io entity (queue or group) has been marked for deletion */
> > +	unsigned int exiting;
> >  
> >  	struct io_sched_data *my_sd;
> >  	struct io_service_tree *st;
> > @@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
> >  extern void elv_free_ioq(struct io_queue *ioq);
> >  extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
> >  extern int elv_iog_should_idle(struct io_queue *ioq);
> > -
> > +extern void elv_exit_ioq(struct io_queue *ioq);
> >  #else /* CONFIG_ELV_FAIR_QUEUING */
> >  static inline struct elv_fq_data *
> >  elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
@ 2009-09-09  1:32       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09  1:32 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 09, 2009 at 01:13:34AM +0200, Fabio Checconi wrote:
> Hi,
> 
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Date: Tue, Sep 08, 2009 06:28:27PM -0400
> >
> > 
> > o I found an issue during test and that is if there is a mix of queue and group
> ...
> >  So we need to keep track of process io queue's vdisktime, even it after got
> >  deleted from io scheduler's service tree and use that same vdisktime if that
> >  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
> >  it can lead to issues if a service tree min_vtime wrap around takes place 
> >  between two requests of the queue. (Agreed that it can be not that easy to
> >  hit but it is possible).
> > 
> >  Hence, keep a cache of io queues serviced recently and when a queue gets
> >  backlogged, if it is found in cache, use that vdisktime otherwise assign
> >  a new vdisktime. This cache of io queues (idle tree), is basically the idea
> >  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
> >  bringing it back. (Now I understand it better. :-)).
> > 
> >  There is one good side affect of keeping the cache of recently service io
> >  queues. Now CFQ can differentiate between streaming readers and new processes
> >  doing IO. Now for a new queue (which is not in the cache), we can assign a
> >  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
> >  time used. This way small file readers or the processes doing small amount
> >  of IO will have reduced latencies at the cost of little reduced throughput of
> >  streaming readers.
> > 
> 
>   just a little note: this patch seems to introduce a special case for
> vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
> time space wraps, so 0 is a perfectly legal value, which can be reached by
> service.  I have no idea if it can produce visible effects, but it doesn't
> seem to be correct.
> 
> 

Hi Fabio,

You are right that technically during wrap arounds one can hit value 0 as
legal value. But I think it is hard to hit at the same time, the only side
affect of it will be that a queue will be either placed favorably (in case of
sync queues) or at the end of tree (if it is async queue).

Async queues anyway go at the end after every dispatch round. So only side
affect is that once during wrap around cycle a sync queue will be placed
favorably and can gain share once in a dispatch round.

I think it is not a big issue at this point of time. But if it becomes
significant, I can introduce a new variable or start passing function
parameter to denote whether we found the queue in cache or not.

But if you think that it is absolutely no no, let me know....

Thanks
Vivek


> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/cfq-iosched.c |    2 
> >  block/elevator-fq.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  block/elevator-fq.h |    9 +
> >  3 files changed, 246 insertions(+), 17 deletions(-)
> > 
> > Index: linux16/block/elevator-fq.c
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.c	2009-09-08 15:44:21.000000000 -0400
> > +++ linux16/block/elevator-fq.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -52,6 +52,8 @@ static struct kmem_cache *elv_ioq_pool;
> >  #define elv_log_entity(entity, fmt, args...)
> >  #endif
> >  
> > +static void check_idle_tree_release(struct io_service_tree *st);
> > +
> >  static inline struct io_queue *ioq_of(struct io_entity *entity)
> >  {
> >  	if (entity->my_sd == NULL)
> > @@ -109,6 +111,11 @@ elv_prio_to_slice(struct elv_fq_data *ef
> >  	return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
> >  }
> >  
> > +static inline int vdisktime_gt(u64 a, u64 b)
> > +{
> > +	return (s64)(a - b) > 0;
> > +}
> > +
> >  static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >  {
> >  	s64 delta = (s64)(vdisktime - min_vdisktime);
> > @@ -145,6 +152,7 @@ static void update_min_vdisktime(struct 
> >  	}
> >  
> >  	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> > +	check_idle_tree_release(st);
> >  }
> >  
> >  static inline struct io_entity *parent_entity(struct io_entity *entity)
> > @@ -411,27 +419,46 @@ static void place_entity(struct io_servi
> >  	struct rb_node *parent;
> >  	struct io_entity *entry;
> >  	int nr_active = st->nr_active - 1;
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	int sync = 1;
> > +
> > +	if (ioq)
> > +		sync = elv_ioq_sync(ioq);
> > +
> > +	if (add_front || !nr_active) {
> > +		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> > +
> > +	if (sync && entity->vdisktime
> > +	    && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
> > +		/* vdisktime still in future. Use old vdisktime */
> > +		vdisktime = entity->vdisktime;
> > +		goto done;
> > +	}
> >  
> >  	/*
> > -	 * Currently put entity at the end of last entity. This probably will
> > -	 * require adjustments as we move along
> > +	 * Effectively a new queue. Assign sync queue a lower vdisktime so
> > +	 * we can achieve better latencies for small file readers. For async
> > +	 * queues, put them at the end of the existing queue.
> > +	 * Group entities are always considered sync.
> >  	 */
> > -	if (io_entity_class_idle(entity)) {
> > -		vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime += entry->vdisktime;
> > -		}
> > -	} else if (!add_front && nr_active) {
> > -		parent = rb_last(&st->active);
> > -		if (parent) {
> > -			entry = rb_entry(parent, struct io_entity, rb_node);
> > -			vdisktime = entry->vdisktime;
> > -		}
> > -	} else
> > +	if (sync) {
> >  		vdisktime = st->min_vdisktime;
> > +		goto done;
> > +	}
> >  
> > +	/*
> > +	 * Put entity at the end of the tree. Effectively async queues use
> > +	 * this path.
> > +	 */
> > +	parent = rb_last(&st->active);
> > +	if (parent) {
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +		vdisktime = entry->vdisktime;
> > +	} else
> > +		vdisktime = st->min_vdisktime;
> > +done:
> >  	entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> >  	elv_log_entity(entity, "place_entity: vdisktime=%llu"
> >  			" min_vdisktime=%llu", entity->vdisktime,
> > @@ -447,6 +474,122 @@ static inline void io_entity_update_prio
> >  		 */
> >  		init_io_entity_service_tree(entity, parent_entity(entity));
> >  		entity->ioprio_changed = 0;
> > +
> > +		/*
> > +		 * Assign this entity a fresh vdisktime instead of using
> > +		 * previous one as prio class will lead to service tree
> > +		 * change and this vdisktime will not be valid on new
> > +		 * service tree.
> > +		 *
> > +		 * TODO: Handle the case of only prio change.
> > +		 */
> > +		entity->vdisktime = 0;
> > +	}
> > +}
> > +
> > +static void
> > +__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	if (st->rb_leftmost_idle == &entity->rb_node) {
> > +		struct rb_node *next_node;
> > +
> > +		next_node = rb_next(&entity->rb_node);
> > +		st->rb_leftmost_idle = next_node;
> > +	}
> > +
> > +	rb_erase(&entity->rb_node, &st->idle);
> > +	RB_CLEAR_NODE(&entity->rb_node);
> > +}
> > +
> > +static void dequeue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +
> > +	__dequeue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 0;
> > +	if (ioq)
> > +		elv_put_ioq(ioq);
> > +}
> > +
> > +static void
> > +__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
> > +{
> > +	struct rb_node **node = &st->idle.rb_node;
> > +	struct rb_node *parent = NULL;
> > +	struct io_entity *entry;
> > +	int leftmost = 1;
> > +
> > +	while (*node != NULL) {
> > +		parent = *node;
> > +		entry = rb_entry(parent, struct io_entity, rb_node);
> > +
> > +		if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
> > +			node = &parent->rb_left;
> > +		else {
> > +			node = &parent->rb_right;
> > +			leftmost = 0;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Maintain a cache of leftmost tree entries (it is frequently
> > +	 * used)
> > +	 */
> > +	if (leftmost)
> > +		st->rb_leftmost_idle = &entity->rb_node;
> > +
> > +	rb_link_node(&entity->rb_node, parent, node);
> > +	rb_insert_color(&entity->rb_node, &st->idle);
> > +}
> > +
> > +static void enqueue_io_entity_idle(struct io_entity *entity)
> > +{
> > +	struct io_queue *ioq = ioq_of(entity);
> > +	struct io_group *parent_iog;
> > +
> > +	/*
> > +	 * Don't put an entity on idle tree if it has been marked for deletion.
> > +	 * We are not expecting more io from this entity. No need to cache it
> > +	 */
> > +
> > +	if (entity->exiting)
> > +		return;
> > +
> > +	/*
> > +	 * If parent group is exiting, don't put on idle tree. May be task got
> > +	 * moved to a different cgroup and original cgroup got deleted
> > +	 */
> > +	parent_iog = iog_of(parent_entity(entity));
> > +	if (parent_iog->entity.exiting)
> > +		return;
> > +
> > +	if (ioq)
> > +		elv_get_ioq(ioq);
> > +	__enqueue_io_entity_idle(entity->st, entity);
> > +	entity->on_idle_st = 1;
> > +}
> > +
> > +static void check_idle_tree_release(struct io_service_tree *st)
> > +{
> > +	struct io_entity *leftmost;
> > +
> > +	if (!st->rb_leftmost_idle)
> > +		return;
> > +
> > +	leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
> > +
> > +	if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
> > +		dequeue_io_entity_idle(leftmost);
> > +}
> > +
> > +static void flush_idle_tree(struct io_service_tree *st)
> > +{
> > +	struct io_entity *entity;
> > +
> > +	while(st->rb_leftmost_idle) {
> > +		entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
> > +					rb_node);
> > +		dequeue_io_entity_idle(entity);
> >  	}
> >  }
> >  
> > @@ -483,6 +626,9 @@ static void dequeue_io_entity(struct io_
> >  	st->nr_active--;
> >  	sd->nr_active--;
> >  	debug_update_stats_dequeue(entity);
> > +
> > +	if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
> > +		enqueue_io_entity_idle(entity);
> >  }
> >  
> >  static void
> > @@ -524,6 +670,16 @@ static void enqueue_io_entity(struct io_
> >  	struct io_service_tree *st;
> >  	struct io_sched_data *sd = io_entity_sched_data(entity);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +	else
> > +		/*
> > +		 * This entity was not in idle tree cache. Zero out vdisktime
> > +		 * so that we don't rely on old vdisktime instead assign a
> > +		 * fresh one.
> > +		 */
> > +		entity->vdisktime = 0;
> > +
> >  	io_entity_update_prio(entity);
> >  	st = entity->st;
> >  	st->nr_active++;
> > @@ -574,6 +730,8 @@ static void requeue_io_entity(struct io_
> >  	struct io_service_tree *st = entity->st;
> >  	struct io_entity *next_entity;
> >  
> > +	entity->vdisktime = 0;
> > +
> >  	if (add_front) {
> >  		next_entity = __lookup_next_io_entity(st);
> >  
> > @@ -1937,11 +2095,18 @@ static void io_free_root_group(struct el
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> >  	struct io_cgroup *iocg = &io_root_cgroup;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> >  	spin_lock_irq(&iocg->lock);
> >  	hlist_del_rcu(&iog->group_node);
> >  	spin_unlock_irq(&iocg->lock);
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	put_io_group_queues(e, iog);
> >  	elv_put_iog(iog);
> >  }
> > @@ -2039,9 +2204,29 @@ EXPORT_SYMBOL(elv_put_iog);
> >   */
> >  static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
> >  {
> > +	struct io_service_tree *st;
> > +	int i;
> > +	struct io_entity *entity = &iog->entity;
> > +
> > +	/*
> > +	 * Mark io group for deletion so that no new entry goes in
> > +	 * idle tree. Any active queue which is removed from active
> > +	 * tree will not be put in to idle tree.
> > +	 */
> > +	entity->exiting = 1;
> > +
> > +	/* We flush idle tree now, and don't put things in there any more. */
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> > +
> >  	hlist_del(&iog->elv_data_node);
> >  	put_io_group_queues(efqd->eq, iog);
> >  
> > +	if (entity->on_idle_st)
> > +		dequeue_io_entity_idle(entity);
> > +
> >  	/*
> >  	 * Put the reference taken at the time of creation so that when all
> >  	 * queues are gone, group can be destroyed.
> > @@ -2374,7 +2559,13 @@ static struct io_group *io_alloc_root_gr
> >  static void io_free_root_group(struct elevator_queue *e)
> >  {
> >  	struct io_group *iog = e->efqd->root_group;
> > +	struct io_service_tree *st;
> > +	int i;
> >  
> > +	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
> > +		st = iog->sched_data.service_tree + i;
> > +		flush_idle_tree(st);
> > +	}
> >  	put_io_group_queues(e, iog);
> >  	kfree(iog);
> >  }
> > @@ -3257,6 +3448,35 @@ done:
> >  		elv_schedule_dispatch(q);
> >  }
> >  
> > +/*
> > + * The process associted with ioq (in case of cfq), is going away. Mark it
> > + * for deletion.
> > + */
> > +void elv_exit_ioq(struct io_queue *ioq)
> > +{
> > +	struct io_entity *entity = &ioq->entity;
> > +
> > +	/*
> > +	 * Async ioq's belong to io group and are cleaned up once group is
> > +	 * being deleted. Not need to do any cleanup here even if cfq has
> > +	 * dropped the reference to the queue
> > +	 */
> > +	if (!elv_ioq_sync(ioq))
> > +		return;
> > +
> > +	/*
> > + 	 * This queue is still under service. Just mark it so that once all
> > +	 * the IO from queue is done, it is not put back in idle tree.
> > +	 */
> > +	if (entity->on_st) {
> > +		entity->exiting = 1;
> > +		return;
> > +	} else if(entity->on_idle_st) {
> > +		/* Remove ioq from idle tree */
> > +		dequeue_io_entity_idle(entity);
> > +	}
> > +}
> > +EXPORT_SYMBOL(elv_exit_ioq);
> >  static void elv_slab_kill(void)
> >  {
> >  	/*
> > Index: linux16/block/cfq-iosched.c
> > ===================================================================
> > --- linux16.orig/block/cfq-iosched.c	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/cfq-iosched.c	2009-09-08 15:47:45.000000000 -0400
> > @@ -1138,6 +1138,7 @@ static void cfq_exit_cfqq(struct cfq_dat
> >  		elv_schedule_dispatch(cfqd->queue);
> >  	}
> >  
> > +	elv_exit_ioq(cfqq->ioq);
> >  	cfq_put_queue(cfqq);
> >  }
> >  
> > @@ -1373,6 +1374,7 @@ static void changed_cgroup(struct io_con
> >  		 */
> >  		if (iog != __iog) {
> >  			cic_set_cfqq(cic, NULL, 1);
> > +			elv_exit_ioq(sync_cfqq->ioq);
> >  			cfq_put_queue(sync_cfqq);
> >  		}
> >  	}
> > Index: linux16/block/elevator-fq.h
> > ===================================================================
> > --- linux16.orig/block/elevator-fq.h	2009-09-08 15:43:36.000000000 -0400
> > +++ linux16/block/elevator-fq.h	2009-09-08 15:47:45.000000000 -0400
> > @@ -33,6 +33,10 @@ struct io_service_tree {
> >  	u64 min_vdisktime;
> >  	struct rb_node *rb_leftmost;
> >  	unsigned int nr_active;
> > +
> > +        /* A cache of io entities which were served and expired */
> > +        struct rb_root idle;
> > +        struct rb_node *rb_leftmost_idle;
> >  };
> >  
> >  struct io_sched_data {
> > @@ -44,9 +48,12 @@ struct io_sched_data {
> >  struct io_entity {
> >  	struct rb_node rb_node;
> >  	int on_st;
> > +	int on_idle_st;
> >  	u64 vdisktime;
> >  	unsigned int weight;
> >  	struct io_entity *parent;
> > +	/* This io entity (queue or group) has been marked for deletion */
> > +	unsigned int exiting;
> >  
> >  	struct io_sched_data *my_sd;
> >  	struct io_service_tree *st;
> > @@ -572,7 +579,7 @@ extern struct io_queue *elv_alloc_ioq(st
> >  extern void elv_free_ioq(struct io_queue *ioq);
> >  extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
> >  extern int elv_iog_should_idle(struct io_queue *ioq);
> > -
> > +extern void elv_exit_ioq(struct io_queue *ioq);
> >  #else /* CONFIG_ELV_FAIR_QUEUING */
> >  static inline struct elv_fq_data *
> >  elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found]       ` <20090909013205.GB3594-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-09  2:03         ` Fabio Checconi
  0 siblings, 0 replies; 322+ messages in thread
From: Fabio Checconi @ 2009-09-09  2:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Tue, Sep 08, 2009 09:32:05PM -0400
>
> On Wed, Sep 09, 2009 at 01:13:34AM +0200, Fabio Checconi wrote:
> > Hi,
> > 
> > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > Date: Tue, Sep 08, 2009 06:28:27PM -0400
> > >
> > > 
> > > o I found an issue during test and that is if there is a mix of queue and group
> > ...
> > >  So we need to keep track of process io queue's vdisktime, even it after got
> > >  deleted from io scheduler's service tree and use that same vdisktime if that
> > >  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
> > >  it can lead to issues if a service tree min_vtime wrap around takes place 
> > >  between two requests of the queue. (Agreed that it can be not that easy to
> > >  hit but it is possible).
> > > 
> > >  Hence, keep a cache of io queues serviced recently and when a queue gets
> > >  backlogged, if it is found in cache, use that vdisktime otherwise assign
> > >  a new vdisktime. This cache of io queues (idle tree), is basically the idea
> > >  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
> > >  bringing it back. (Now I understand it better. :-)).
> > > 
> > >  There is one good side affect of keeping the cache of recently service io
> > >  queues. Now CFQ can differentiate between streaming readers and new processes
> > >  doing IO. Now for a new queue (which is not in the cache), we can assign a
> > >  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
> > >  time used. This way small file readers or the processes doing small amount
> > >  of IO will have reduced latencies at the cost of little reduced throughput of
> > >  streaming readers.
> > > 
> > 
> >   just a little note: this patch seems to introduce a special case for
> > vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
> > time space wraps, so 0 is a perfectly legal value, which can be reached by
> > service.  I have no idea if it can produce visible effects, but it doesn't
> > seem to be correct.
> > 
> > 
> 
> Hi Fabio,
> 
> You are right that technically during wrap arounds one can hit value 0 as
> legal value. But I think it is hard to hit at the same time, the only side
> affect of it will be that a queue will be either placed favorably (in case of
> sync queues) or at the end of tree (if it is async queue).
> 
> Async queues anyway go at the end after every dispatch round. So only side
> affect is that once during wrap around cycle a sync queue will be placed
> favorably and can gain share once in a dispatch round.
> 
> I think it is not a big issue at this point of time. But if it becomes
> significant, I can introduce a new variable or start passing function
> parameter to denote whether we found the queue in cache or not.
> 
> But if you think that it is absolutely no no, let me know....
> 

I don't think it's an issue at all, just wanted to make sure it gets
noticed, because timestamping bugs may be hard to hit but often are
hard to debug.  Maybe it deserves a line of comment...

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-09  1:32       ` Vivek Goyal
  (?)
  (?)
@ 2009-09-09  2:03       ` Fabio Checconi
  -1 siblings, 0 replies; 322+ messages in thread
From: Fabio Checconi @ 2009-09-09  2:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

> From: Vivek Goyal <vgoyal@redhat.com>
> Date: Tue, Sep 08, 2009 09:32:05PM -0400
>
> On Wed, Sep 09, 2009 at 01:13:34AM +0200, Fabio Checconi wrote:
> > Hi,
> > 
> > > From: Vivek Goyal <vgoyal@redhat.com>
> > > Date: Tue, Sep 08, 2009 06:28:27PM -0400
> > >
> > > 
> > > o I found an issue during test and that is if there is a mix of queue and group
> > ...
> > >  So we need to keep track of process io queue's vdisktime, even it after got
> > >  deleted from io scheduler's service tree and use that same vdisktime if that
> > >  queue gets backlogged again. But trusting a ioq's vdisktime is bad because
> > >  it can lead to issues if a service tree min_vtime wrap around takes place 
> > >  between two requests of the queue. (Agreed that it can be not that easy to
> > >  hit but it is possible).
> > > 
> > >  Hence, keep a cache of io queues serviced recently and when a queue gets
> > >  backlogged, if it is found in cache, use that vdisktime otherwise assign
> > >  a new vdisktime. This cache of io queues (idle tree), is basically the idea
> > >  implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
> > >  bringing it back. (Now I understand it better. :-)).
> > > 
> > >  There is one good side affect of keeping the cache of recently service io
> > >  queues. Now CFQ can differentiate between streaming readers and new processes
> > >  doing IO. Now for a new queue (which is not in the cache), we can assign a
> > >  lower vdisktime and for a streaming reader, we assign vdisktime based on disk
> > >  time used. This way small file readers or the processes doing small amount
> > >  of IO will have reduced latencies at the cost of little reduced throughput of
> > >  streaming readers.
> > > 
> > 
> >   just a little note: this patch seems to introduce a special case for
> > vdisktime = 0, assigning it the meaning of "bad timestamp," but the virtual
> > time space wraps, so 0 is a perfectly legal value, which can be reached by
> > service.  I have no idea if it can produce visible effects, but it doesn't
> > seem to be correct.
> > 
> > 
> 
> Hi Fabio,
> 
> You are right that technically during wrap arounds one can hit value 0 as
> legal value. But I think it is hard to hit at the same time, the only side
> affect of it will be that a queue will be either placed favorably (in case of
> sync queues) or at the end of tree (if it is async queue).
> 
> Async queues anyway go at the end after every dispatch round. So only side
> affect is that once during wrap around cycle a sync queue will be placed
> favorably and can gain share once in a dispatch round.
> 
> I think it is not a big issue at this point of time. But if it becomes
> significant, I can introduce a new variable or start passing function
> parameter to denote whether we found the queue in cache or not.
> 
> But if you think that it is absolutely no no, let me know....
> 

I don't think it's an issue at all, just wanted to make sure it gets
noticed, because timestamping bugs may be hard to hit but often are
hard to debug.  Maybe it deserves a line of comment...

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle
       [not found]   ` <20090908222821.GB3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-09  3:39     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  3:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o It is possible that when there is only a single queue in the system, it
>   remains unexpired for a long time (because there is no IO activity on the
>   disk). So when next request comes in after a long time, it might make
>   scheduler think that all this while queue used the disk and it will assign
>   a high vdisktime to the queue. Hence make sure queue is expired once all
>   the requests have completed from the queue.
> 
> o Also avoid unnecessarily expiring a queue when it has got one request
>   dispatched to the queue and waiting for it to finish and it does not have
>   more requests queued to dispatch.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle
  2009-09-08 22:28 ` [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle Vivek Goyal
       [not found]   ` <20090908222821.GB3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-09  3:39   ` Rik van Riel
  1 sibling, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  3:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o It is possible that when there is only a single queue in the system, it
>   remains unexpired for a long time (because there is no IO activity on the
>   disk). So when next request comes in after a long time, it might make
>   scheduler think that all this while queue used the disk and it will assign
>   a high vdisktime to the queue. Hence make sure queue is expired once all
>   the requests have completed from the queue.
> 
> o Also avoid unnecessarily expiring a queue when it has got one request
>   dispatched to the queue and waiting for it to finish and it does not have
>   more requests queued to dispatch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
       [not found]   ` <20090908222827.GC3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-08 22:37     ` Daniel Walker
  2009-09-08 23:13     ` Fabio Checconi
@ 2009-09-09  4:44     ` Rik van Riel
  2 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o I found an issue during test and that is if there is a mix of queue and group
>   at same level, there can be fairness issue. For example, consider following
>   case.

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

(assuming the cleanups are done in V10 - the artifact of
  a wraparound hitting exactly 0 should not be an issue)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
  2009-09-08 22:28   ` Vivek Goyal
@ 2009-09-09  4:44     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o I found an issue during test and that is if there is a mix of queue and group
>   at same level, there can be fairness issue. For example, consider following
>   case.

Acked-by: Rik van Riel <riel@redhat.com>

(assuming the cleanups are done in V10 - the artifact of
  a wraparound hitting exactly 0 should not be an issue)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 25/23] io-controller: fix queue vs group fairness
@ 2009-09-09  4:44     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o I found an issue during test and that is if there is a mix of queue and group
>   at same level, there can be fairness issue. For example, consider following
>   case.

Acked-by: Rik van Riel <riel@redhat.com>

(assuming the cleanups are done in V10 - the artifact of
  a wraparound hitting exactly 0 should not be an issue)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 26/23] io-controller: fix writer preemption with in a group
       [not found]   ` <20090908222835.GD3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-09  4:59     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> o Found another issue during testing. Consider following hierarchy.
> 
> 			root
> 			/ \
> 		       R1  G1
> 			  /\
> 			 R2 W
> 
>   Generally in CFQ when readers and writers are running, reader immediately
>   preempts writers and hence reader gets the better bandwidth. In case of
>   hierarchical setup, it becomes little more tricky. In above diagram, G1
>   is a group and R1, R2 are readers and W is writer tasks.
> 
>   Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
>   time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
>   again in group G1, (streaming reader). But it will not preempt R1 as R1 is
>   also a reader and also because preemption across group is not allowed for
>   isolation reasons. Hence R2 will get backlogged in G1 and will get a 
>   vdisktime much higher than W. So when G2 gets scheduled again, W will get
>   to run its full slice length despite the fact R2 is queue on same service
>   tree.
> 
>   The core issue here is that apart from regular preemptions (preemption 
>   across classes), CFQ also has this special notion of preemption with-in
>   class and that can lead to issues active task is running in a differnt
>   group than where new queue gets backlogged.
> 
>   To solve the issue keep a track of this event (I am calling it late
>   preemption). When a group becomes eligible to run again, if late_preemption
>   is set, check if there are sync readers backlogged, and if yes, expire the
>   writer after one round of dispatch.
> 
>   This solves the issue of reader not getting enough bandwidth in hierarchical
>   setups.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Conceptually a nice solution.  The code gets a little tricky,
but I guess any code dealing with these situations would end
up that way :)

Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 26/23] io-controller: fix writer preemption with in a group
  2009-09-08 22:28   ` Vivek Goyal
@ 2009-09-09  4:59     ` Rik van Riel
  -1 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

Vivek Goyal wrote:
> o Found another issue during testing. Consider following hierarchy.
> 
> 			root
> 			/ \
> 		       R1  G1
> 			  /\
> 			 R2 W
> 
>   Generally in CFQ when readers and writers are running, reader immediately
>   preempts writers and hence reader gets the better bandwidth. In case of
>   hierarchical setup, it becomes little more tricky. In above diagram, G1
>   is a group and R1, R2 are readers and W is writer tasks.
> 
>   Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
>   time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
>   again in group G1, (streaming reader). But it will not preempt R1 as R1 is
>   also a reader and also because preemption across group is not allowed for
>   isolation reasons. Hence R2 will get backlogged in G1 and will get a 
>   vdisktime much higher than W. So when G2 gets scheduled again, W will get
>   to run its full slice length despite the fact R2 is queue on same service
>   tree.
> 
>   The core issue here is that apart from regular preemptions (preemption 
>   across classes), CFQ also has this special notion of preemption with-in
>   class and that can lead to issues active task is running in a differnt
>   group than where new queue gets backlogged.
> 
>   To solve the issue keep a track of this event (I am calling it late
>   preemption). When a group becomes eligible to run again, if late_preemption
>   is set, check if there are sync readers backlogged, and if yes, expire the
>   writer after one round of dispatch.
> 
>   This solves the issue of reader not getting enough bandwidth in hierarchical
>   setups.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Conceptually a nice solution.  The code gets a little tricky,
but I guess any code dealing with these situations would end
up that way :)

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 26/23] io-controller: fix writer preemption with in a group
@ 2009-09-09  4:59     ` Rik van Riel
  0 siblings, 0 replies; 322+ messages in thread
From: Rik van Riel @ 2009-09-09  4:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> o Found another issue during testing. Consider following hierarchy.
> 
> 			root
> 			/ \
> 		       R1  G1
> 			  /\
> 			 R2 W
> 
>   Generally in CFQ when readers and writers are running, reader immediately
>   preempts writers and hence reader gets the better bandwidth. In case of
>   hierarchical setup, it becomes little more tricky. In above diagram, G1
>   is a group and R1, R2 are readers and W is writer tasks.
> 
>   Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
>   time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
>   again in group G1, (streaming reader). But it will not preempt R1 as R1 is
>   also a reader and also because preemption across group is not allowed for
>   isolation reasons. Hence R2 will get backlogged in G1 and will get a 
>   vdisktime much higher than W. So when G2 gets scheduled again, W will get
>   to run its full slice length despite the fact R2 is queue on same service
>   tree.
> 
>   The core issue here is that apart from regular preemptions (preemption 
>   across classes), CFQ also has this special notion of preemption with-in
>   class and that can lead to issues active task is running in a differnt
>   group than where new queue gets backlogged.
> 
>   To solve the issue keep a track of this event (I am calling it late
>   preemption). When a group becomes eligible to run again, if late_preemption
>   is set, check if there are sync readers backlogged, and if yes, expire the
>   writer after one round of dispatch.
> 
>   This solves the issue of reader not getting enough bandwidth in hierarchical
>   setups.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Conceptually a nice solution.  The code gets a little tricky,
but I guess any code dealing with these situations would end
up that way :)

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]     ` <20090908191941.GF15974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-09  7:38       ` Gui Jianfeng
  2009-09-09  9:41       ` Jens Axboe
  1 sibling, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-09  7:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> I happened to encount a bug when i test IO Controller V9.
>> When there are three tasks to run concurrently in three group,
>> that is, one is parent group, and other two tasks are running 
>> in two different child groups respectively to read or write 
>> files in some disk, say disk "hdb", The task may hang up, and 
>> other tasks which access into "hdb" will also hang up.
>>
>> The bug only happens when using AS io scheduler.
>> The following scirpt can reproduce this bug in my box.
>>
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.
> 
> Can you please dig deeper a bit into it. 
> 
> - If whole system hangs or it is just IO to disk seems to be hung.

    Only when the task is trying do IO to disk it will hang up.

> - Does io scheduler switch on the device work

    yes, io scheduler can be switched, and the hung task will be resumed.

> - If the system is not hung, can you capture the blktrace on the device.
>   Trace might give some idea, what's happening.

I run a "find" task to do some io on that disk, it seems that task hangs 
when it is issuing getdents() syscall.
kernel generates the following message:

INFO: task find:3260 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
find          D a1e95787  1912  3260   2897 0x00000004
 f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
 00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
 0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
Call Trace:
 [<c0447323>] ? getnstimeofday+0x57/0xe0
 [<c04438df>] ? ktime_get_ts+0x4a/0x4e
 [<c068ab68>] io_schedule+0x47/0x79
 [<c04c12ee>] sync_buffer+0x36/0x3a
 [<c068ae14>] __wait_on_bit+0x36/0x5d
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c0440fa4>] ? wake_bit_function+0x0/0x43
 [<c04c1249>] __wait_on_buffer+0x19/0x1c
 [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
 [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
 [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
 [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
 [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
 [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
 [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
 [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
 [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
 [<c04b12db>] ? vfs_readdir+0x46/0x94
 [<c04b12fd>] vfs_readdir+0x68/0x94
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c04b1387>] sys_getdents64+0x5e/0x9f
 [<c04028b4>] sysenter_do_call+0x12/0x32
1 lock held by find/3260:
 #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94

ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
state, and I found this task will be resumed after a quite long period(more than 10 mins).

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-08 19:19     ` Vivek Goyal
@ 2009-09-09  7:38       ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-09  7:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> I happened to encount a bug when i test IO Controller V9.
>> When there are three tasks to run concurrently in three group,
>> that is, one is parent group, and other two tasks are running 
>> in two different child groups respectively to read or write 
>> files in some disk, say disk "hdb", The task may hang up, and 
>> other tasks which access into "hdb" will also hang up.
>>
>> The bug only happens when using AS io scheduler.
>> The following scirpt can reproduce this bug in my box.
>>
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.
> 
> Can you please dig deeper a bit into it. 
> 
> - If whole system hangs or it is just IO to disk seems to be hung.

    Only when the task is trying do IO to disk it will hang up.

> - Does io scheduler switch on the device work

    yes, io scheduler can be switched, and the hung task will be resumed.

> - If the system is not hung, can you capture the blktrace on the device.
>   Trace might give some idea, what's happening.

I run a "find" task to do some io on that disk, it seems that task hangs 
when it is issuing getdents() syscall.
kernel generates the following message:

INFO: task find:3260 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
find          D a1e95787  1912  3260   2897 0x00000004
 f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
 00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
 0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
Call Trace:
 [<c0447323>] ? getnstimeofday+0x57/0xe0
 [<c04438df>] ? ktime_get_ts+0x4a/0x4e
 [<c068ab68>] io_schedule+0x47/0x79
 [<c04c12ee>] sync_buffer+0x36/0x3a
 [<c068ae14>] __wait_on_bit+0x36/0x5d
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c0440fa4>] ? wake_bit_function+0x0/0x43
 [<c04c1249>] __wait_on_buffer+0x19/0x1c
 [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
 [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
 [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
 [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
 [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
 [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
 [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
 [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
 [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
 [<c04b12db>] ? vfs_readdir+0x46/0x94
 [<c04b12fd>] vfs_readdir+0x68/0x94
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c04b1387>] sys_getdents64+0x5e/0x9f
 [<c04028b4>] sysenter_do_call+0x12/0x32
1 lock held by find/3260:
 #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94

ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
state, and I found this task will be resumed after a quite long period(more than 10 mins).

-- 
Regards
Gui Jianfeng



^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-09  7:38       ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-09  7:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> I happened to encount a bug when i test IO Controller V9.
>> When there are three tasks to run concurrently in three group,
>> that is, one is parent group, and other two tasks are running 
>> in two different child groups respectively to read or write 
>> files in some disk, say disk "hdb", The task may hang up, and 
>> other tasks which access into "hdb" will also hang up.
>>
>> The bug only happens when using AS io scheduler.
>> The following scirpt can reproduce this bug in my box.
>>
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.
> 
> Can you please dig deeper a bit into it. 
> 
> - If whole system hangs or it is just IO to disk seems to be hung.

    Only when the task is trying do IO to disk it will hang up.

> - Does io scheduler switch on the device work

    yes, io scheduler can be switched, and the hung task will be resumed.

> - If the system is not hung, can you capture the blktrace on the device.
>   Trace might give some idea, what's happening.

I run a "find" task to do some io on that disk, it seems that task hangs 
when it is issuing getdents() syscall.
kernel generates the following message:

INFO: task find:3260 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
find          D a1e95787  1912  3260   2897 0x00000004
 f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
 00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
 0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
Call Trace:
 [<c0447323>] ? getnstimeofday+0x57/0xe0
 [<c04438df>] ? ktime_get_ts+0x4a/0x4e
 [<c068ab68>] io_schedule+0x47/0x79
 [<c04c12ee>] sync_buffer+0x36/0x3a
 [<c068ae14>] __wait_on_bit+0x36/0x5d
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
 [<c04c12b8>] ? sync_buffer+0x0/0x3a
 [<c0440fa4>] ? wake_bit_function+0x0/0x43
 [<c04c1249>] __wait_on_buffer+0x19/0x1c
 [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
 [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
 [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
 [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
 [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
 [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
 [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
 [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
 [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
 [<c04b12db>] ? vfs_readdir+0x46/0x94
 [<c04b12fd>] vfs_readdir+0x68/0x94
 [<c04b1100>] ? filldir64+0x0/0xcd
 [<c04b1387>] sys_getdents64+0x5e/0x9f
 [<c04028b4>] sysenter_do_call+0x12/0x32
1 lock held by find/3260:
 #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94

ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
state, and I found this task will be resumed after a quite long period(more than 10 mins).

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]     ` <20090908191941.GF15974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-09  7:38       ` Gui Jianfeng
@ 2009-09-09  9:41       ` Jens Axboe
  1 sibling, 0 replies; 322+ messages in thread
From: Jens Axboe @ 2009-09-09  9:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 08 2009, Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> > Hi Vivek,
> > 
> > I happened to encount a bug when i test IO Controller V9.
> > When there are three tasks to run concurrently in three group,
> > that is, one is parent group, and other two tasks are running 
> > in two different child groups respectively to read or write 
> > files in some disk, say disk "hdb", The task may hang up, and 
> > other tasks which access into "hdb" will also hang up.
> > 
> > The bug only happens when using AS io scheduler.
> > The following scirpt can reproduce this bug in my box.
> > 
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.

The key factor is likely the old IDE driver, since this is IO scheduler
related. It probably works if Gui uses libata instead, and you can
probably reproduce if you could use drivers/ide/ on a disk.

So the likely culprit is probably a missing queue restart somewhere on
IO completion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-08 19:19     ` Vivek Goyal
@ 2009-09-09  9:41       ` Jens Axboe
  -1 siblings, 0 replies; 322+ messages in thread
From: Jens Axboe @ 2009-09-09  9:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Tue, Sep 08 2009, Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> > Hi Vivek,
> > 
> > I happened to encount a bug when i test IO Controller V9.
> > When there are three tasks to run concurrently in three group,
> > that is, one is parent group, and other two tasks are running 
> > in two different child groups respectively to read or write 
> > files in some disk, say disk "hdb", The task may hang up, and 
> > other tasks which access into "hdb" will also hang up.
> > 
> > The bug only happens when using AS io scheduler.
> > The following scirpt can reproduce this bug in my box.
> > 
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.

The key factor is likely the old IDE driver, since this is IO scheduler
related. It probably works if Gui uses libata instead, and you can
probably reproduce if you could use drivers/ide/ on a disk.

So the likely culprit is probably a missing queue restart somewhere on
IO completion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-09  9:41       ` Jens Axboe
  0 siblings, 0 replies; 322+ messages in thread
From: Jens Axboe @ 2009-09-09  9:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, Gui Jianfeng, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Tue, Sep 08 2009, Vivek Goyal wrote:
> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> > Hi Vivek,
> > 
> > I happened to encount a bug when i test IO Controller V9.
> > When there are three tasks to run concurrently in three group,
> > that is, one is parent group, and other two tasks are running 
> > in two different child groups respectively to read or write 
> > files in some disk, say disk "hdb", The task may hang up, and 
> > other tasks which access into "hdb" will also hang up.
> > 
> > The bug only happens when using AS io scheduler.
> > The following scirpt can reproduce this bug in my box.
> > 
> 
> Hi Gui,
> 
> I tried reproducing this on my system and can't reproduce it. All the
> three processes get killed and system does not hang.

The key factor is likely the old IDE driver, since this is IO scheduler
related. It probably works if Gui uses libata instead, and you can
probably reproduce if you could use drivers/ide/ on a disk.

So the likely culprit is probably a missing queue restart somewhere on
IO completion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]       ` <4AA75B71.5060109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-09 15:05         ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09 15:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >> Hi Vivek,
> >>
> >> I happened to encount a bug when i test IO Controller V9.
> >> When there are three tasks to run concurrently in three group,
> >> that is, one is parent group, and other two tasks are running 
> >> in two different child groups respectively to read or write 
> >> files in some disk, say disk "hdb", The task may hang up, and 
> >> other tasks which access into "hdb" will also hang up.
> >>
> >> The bug only happens when using AS io scheduler.
> >> The following scirpt can reproduce this bug in my box.
> >>
> > 
> > Hi Gui,
> > 
> > I tried reproducing this on my system and can't reproduce it. All the
> > three processes get killed and system does not hang.
> > 
> > Can you please dig deeper a bit into it. 
> > 
> > - If whole system hangs or it is just IO to disk seems to be hung.
> 
>     Only when the task is trying do IO to disk it will hang up.
> 
> > - Does io scheduler switch on the device work
> 
>     yes, io scheduler can be switched, and the hung task will be resumed.
> 
> > - If the system is not hung, can you capture the blktrace on the device.
> >   Trace might give some idea, what's happening.
> 
> I run a "find" task to do some io on that disk, it seems that task hangs 
> when it is issuing getdents() syscall.
> kernel generates the following message:
> 
> INFO: task find:3260 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> find          D a1e95787  1912  3260   2897 0x00000004
>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> Call Trace:
>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>  [<c068ab68>] io_schedule+0x47/0x79
>  [<c04c12ee>] sync_buffer+0x36/0x3a
>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>  [<c04b12fd>] vfs_readdir+0x68/0x94
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>  [<c04028b4>] sysenter_do_call+0x12/0x32
> 1 lock held by find/3260:
>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> 
> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> state, and I found this task will be resumed after a quite long period(more than 10 mins).

Thanks Gui. As Jens said, it does look like a case of missing queue
restart somewhere and now we are stuck, no requests are being dispatched
to the disk and queue is already unplugged.

Can you please also try capturing the trace of events at io scheduler
(blktrace) to see how did we get into that situation.

Are you using ide drivers and not libata? As jens said, I will try to make
use of ide drivers and see if I can reproduce it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-09  7:38       ` Gui Jianfeng
@ 2009-09-09 15:05         ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09 15:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >> Hi Vivek,
> >>
> >> I happened to encount a bug when i test IO Controller V9.
> >> When there are three tasks to run concurrently in three group,
> >> that is, one is parent group, and other two tasks are running 
> >> in two different child groups respectively to read or write 
> >> files in some disk, say disk "hdb", The task may hang up, and 
> >> other tasks which access into "hdb" will also hang up.
> >>
> >> The bug only happens when using AS io scheduler.
> >> The following scirpt can reproduce this bug in my box.
> >>
> > 
> > Hi Gui,
> > 
> > I tried reproducing this on my system and can't reproduce it. All the
> > three processes get killed and system does not hang.
> > 
> > Can you please dig deeper a bit into it. 
> > 
> > - If whole system hangs or it is just IO to disk seems to be hung.
> 
>     Only when the task is trying do IO to disk it will hang up.
> 
> > - Does io scheduler switch on the device work
> 
>     yes, io scheduler can be switched, and the hung task will be resumed.
> 
> > - If the system is not hung, can you capture the blktrace on the device.
> >   Trace might give some idea, what's happening.
> 
> I run a "find" task to do some io on that disk, it seems that task hangs 
> when it is issuing getdents() syscall.
> kernel generates the following message:
> 
> INFO: task find:3260 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> find          D a1e95787  1912  3260   2897 0x00000004
>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> Call Trace:
>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>  [<c068ab68>] io_schedule+0x47/0x79
>  [<c04c12ee>] sync_buffer+0x36/0x3a
>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>  [<c04b12fd>] vfs_readdir+0x68/0x94
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>  [<c04028b4>] sysenter_do_call+0x12/0x32
> 1 lock held by find/3260:
>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> 
> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> state, and I found this task will be resumed after a quite long period(more than 10 mins).

Thanks Gui. As Jens said, it does look like a case of missing queue
restart somewhere and now we are stuck, no requests are being dispatched
to the disk and queue is already unplugged.

Can you please also try capturing the trace of events at io scheduler
(blktrace) to see how did we get into that situation.

Are you using ide drivers and not libata? As jens said, I will try to make
use of ide drivers and see if I can reproduce it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-09 15:05         ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-09 15:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >> Hi Vivek,
> >>
> >> I happened to encount a bug when i test IO Controller V9.
> >> When there are three tasks to run concurrently in three group,
> >> that is, one is parent group, and other two tasks are running 
> >> in two different child groups respectively to read or write 
> >> files in some disk, say disk "hdb", The task may hang up, and 
> >> other tasks which access into "hdb" will also hang up.
> >>
> >> The bug only happens when using AS io scheduler.
> >> The following scirpt can reproduce this bug in my box.
> >>
> > 
> > Hi Gui,
> > 
> > I tried reproducing this on my system and can't reproduce it. All the
> > three processes get killed and system does not hang.
> > 
> > Can you please dig deeper a bit into it. 
> > 
> > - If whole system hangs or it is just IO to disk seems to be hung.
> 
>     Only when the task is trying do IO to disk it will hang up.
> 
> > - Does io scheduler switch on the device work
> 
>     yes, io scheduler can be switched, and the hung task will be resumed.
> 
> > - If the system is not hung, can you capture the blktrace on the device.
> >   Trace might give some idea, what's happening.
> 
> I run a "find" task to do some io on that disk, it seems that task hangs 
> when it is issuing getdents() syscall.
> kernel generates the following message:
> 
> INFO: task find:3260 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> find          D a1e95787  1912  3260   2897 0x00000004
>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> Call Trace:
>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>  [<c068ab68>] io_schedule+0x47/0x79
>  [<c04c12ee>] sync_buffer+0x36/0x3a
>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>  [<c04b12fd>] vfs_readdir+0x68/0x94
>  [<c04b1100>] ? filldir64+0x0/0xcd
>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>  [<c04028b4>] sysenter_do_call+0x12/0x32
> 1 lock held by find/3260:
>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> 
> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> state, and I found this task will be resumed after a quite long period(more than 10 mins).

Thanks Gui. As Jens said, it does look like a case of missing queue
restart somewhere and now we are stuck, no requests are being dispatched
to the disk and queue is already unplugged.

Can you please also try capturing the trace of events at io scheduler
(blktrace) to see how did we get into that situation.

Are you using ide drivers and not libata? As jens said, I will try to make
use of ide drivers and see if I can reproduce it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]         ` <20090909150537.GD8256-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  3:20           ` Gui Jianfeng
  2009-09-11  1:15           ` [PATCH] io-controller: Fix task hanging when there are more than one groups Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-10  3:20 UTC (permalink / raw)
  To: Vivek Goyal, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> I happened to encount a bug when i test IO Controller V9.
>>>> When there are three tasks to run concurrently in three group,
>>>> that is, one is parent group, and other two tasks are running 
>>>> in two different child groups respectively to read or write 
>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>> other tasks which access into "hdb" will also hang up.
>>>>
>>>> The bug only happens when using AS io scheduler.
>>>> The following scirpt can reproduce this bug in my box.
>>>>
>>> Hi Gui,
>>>
>>> I tried reproducing this on my system and can't reproduce it. All the
>>> three processes get killed and system does not hang.
>>>
>>> Can you please dig deeper a bit into it. 
>>>
>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>     Only when the task is trying do IO to disk it will hang up.
>>
>>> - Does io scheduler switch on the device work
>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>
>>> - If the system is not hung, can you capture the blktrace on the device.
>>>   Trace might give some idea, what's happening.
>> I run a "find" task to do some io on that disk, it seems that task hangs 
>> when it is issuing getdents() syscall.
>> kernel generates the following message:
>>
>> INFO: task find:3260 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> find          D a1e95787  1912  3260   2897 0x00000004
>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>> Call Trace:
>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>  [<c068ab68>] io_schedule+0x47/0x79
>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>> 1 lock held by find/3260:
>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>
>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> 
> Thanks Gui. As Jens said, it does look like a case of missing queue
> restart somewhere and now we are stuck, no requests are being dispatched
> to the disk and queue is already unplugged.
> 
> Can you please also try capturing the trace of events at io scheduler
> (blktrace) to see how did we get into that situation.

  Ok, I'll try.

> 
> Are you using ide drivers and not libata? As jens said, I will try to make
> use of ide drivers and see if I can reproduce it.

  Hi Vivek, Jens,

  yes, i used the old ide driver. So I switch to libata instead(deactivate the 
  whole “ATA/ATAPI/MFM/RLL support”), the bug still exists and I can reproduce
  it by that script.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-09 15:05         ` Vivek Goyal
  (?)
@ 2009-09-10  3:20         ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-10  3:20 UTC (permalink / raw)
  To: Vivek Goyal, jens.axboe
  Cc: linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> I happened to encount a bug when i test IO Controller V9.
>>>> When there are three tasks to run concurrently in three group,
>>>> that is, one is parent group, and other two tasks are running 
>>>> in two different child groups respectively to read or write 
>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>> other tasks which access into "hdb" will also hang up.
>>>>
>>>> The bug only happens when using AS io scheduler.
>>>> The following scirpt can reproduce this bug in my box.
>>>>
>>> Hi Gui,
>>>
>>> I tried reproducing this on my system and can't reproduce it. All the
>>> three processes get killed and system does not hang.
>>>
>>> Can you please dig deeper a bit into it. 
>>>
>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>     Only when the task is trying do IO to disk it will hang up.
>>
>>> - Does io scheduler switch on the device work
>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>
>>> - If the system is not hung, can you capture the blktrace on the device.
>>>   Trace might give some idea, what's happening.
>> I run a "find" task to do some io on that disk, it seems that task hangs 
>> when it is issuing getdents() syscall.
>> kernel generates the following message:
>>
>> INFO: task find:3260 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> find          D a1e95787  1912  3260   2897 0x00000004
>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>> Call Trace:
>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>  [<c068ab68>] io_schedule+0x47/0x79
>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>> 1 lock held by find/3260:
>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>
>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> 
> Thanks Gui. As Jens said, it does look like a case of missing queue
> restart somewhere and now we are stuck, no requests are being dispatched
> to the disk and queue is already unplugged.
> 
> Can you please also try capturing the trace of events at io scheduler
> (blktrace) to see how did we get into that situation.

  Ok, I'll try.

> 
> Are you using ide drivers and not libata? As jens said, I will try to make
> use of ide drivers and see if I can reproduce it.

  Hi Vivek, Jens,

  yes, i used the old ide driver. So I switch to libata instead(deactivate the 
  whole “ATA/ATAPI/MFM/RLL support”), the bug still exists and I can reproduce
  it by that script.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (29 preceding siblings ...)
  2009-09-08 22:28   ` [PATCH 26/23] io-controller: fix writer preemption with in a group Vivek Goyal
@ 2009-09-10 15:18   ` Jerome Marchand
  30 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-10 15:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
 
Hi Vivek,

I've run some postgresql benchmarks for io-controller. Tests have been
made with 2.6.31-rc6 kernel, without io-controller patches (when
relevant) and with io-controller v8 and v9 patches.
I set up two instances of the TPC-H database, each running in their
own io-cgroup. I ran two clients to these databases and tested on each
that simple request:
$ select count(*) from LINEITEM;
where LINEITEM is the biggest table of TPC-H (6001215 entries,
720MB). That request generates a steady stream of IOs.

Time is measure by psql (\timing switched on). Each test is run twice
or more if there is any significant difference between the first two
runs. Before each run, the cache is flush:
$ echo 3 > /proc/sys/vm/drop_caches


Results with 2 groups of same io policy (BE) and same io weight (1000):

	w/o io-scheduler	io-scheduler v8		io-scheduler v9
	first	second		first	second		first	second
	DB	DB		DB	DB		DB	DB

CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s

As you can see, there is no significant difference for CFQ
scheduler. There is big improvement for noop and deadline schedulers
(why is that happening?). The performance with anticipatory scheduler
is a bit lower (~4%).


Results with 2 groups of same io policy (BE), different io weights and
CFQ scheduler:
			io-scheduler v8		io-scheduler v9
weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s

The result in term of fairness is close to what we can expect from the
ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
the first request get 2/3 (4/5) of io time as long as it runs and thus
finish in about 3/4 (5/8) of total time. 


Results  with 2 groups of different io policies, same io weight and
CFQ scheduler:
			io-scheduler v8		io-scheduler v9
policy = RT, BE		22.5s	45.3s		22.4s	45.0s
policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s

Here again, the result in term of fairness is very close from what we
expect.

Thanks,
Jerome

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-08-28 21:30 ` Vivek Goyal
                   ` (31 preceding siblings ...)
  (?)
@ 2009-09-10 15:18 ` Jerome Marchand
  2009-09-10 20:52     ` Vivek Goyal
                     ` (2 more replies)
  -1 siblings, 3 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-10 15:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
 
Hi Vivek,

I've run some postgresql benchmarks for io-controller. Tests have been
made with 2.6.31-rc6 kernel, without io-controller patches (when
relevant) and with io-controller v8 and v9 patches.
I set up two instances of the TPC-H database, each running in their
own io-cgroup. I ran two clients to these databases and tested on each
that simple request:
$ select count(*) from LINEITEM;
where LINEITEM is the biggest table of TPC-H (6001215 entries,
720MB). That request generates a steady stream of IOs.

Time is measure by psql (\timing switched on). Each test is run twice
or more if there is any significant difference between the first two
runs. Before each run, the cache is flush:
$ echo 3 > /proc/sys/vm/drop_caches


Results with 2 groups of same io policy (BE) and same io weight (1000):

	w/o io-scheduler	io-scheduler v8		io-scheduler v9
	first	second		first	second		first	second
	DB	DB		DB	DB		DB	DB

CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s

As you can see, there is no significant difference for CFQ
scheduler. There is big improvement for noop and deadline schedulers
(why is that happening?). The performance with anticipatory scheduler
is a bit lower (~4%).


Results with 2 groups of same io policy (BE), different io weights and
CFQ scheduler:
			io-scheduler v8		io-scheduler v9
weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s

The result in term of fairness is close to what we can expect from the
ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
the first request get 2/3 (4/5) of io time as long as it runs and thus
finish in about 3/4 (5/8) of total time. 


Results  with 2 groups of different io policies, same io weight and
CFQ scheduler:
			io-scheduler v8		io-scheduler v9
policy = RT, BE		22.5s	45.3s		22.4s	45.0s
policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s

Here again, the result in term of fairness is very close from what we
expect.

Thanks,
Jerome

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
       [not found]     ` <4A9B3B0B.9090009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 17:32       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 17:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Aug 30, 2009 at 10:52:59PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes noop to use queue scheduling code from elevator layer.
>> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
>>
>> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> Not sure why noop needs hierarchical fair queueing
> support, but this patch is so small we might as well
> take it to keep things consistent between schedulers.
>

Thinking more about it. It probably can be useful for the case ryo is
pointing out where fast SSD drivers are not making use of kernel IO
schedulers.

If they want group io schduling support on these SSDs, they can modify their
driver to make use of hierarchical noop.

> Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks
Vivek

>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
  2009-08-31  2:52     ` Rik van Riel
@ 2009-09-10 17:32       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 17:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo

On Sun, Aug 30, 2009 at 10:52:59PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes noop to use queue scheduling code from elevator layer.
>> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
>>
>> Signed-off-by: Nauman Rafique <nauman@google.com>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Not sure why noop needs hierarchical fair queueing
> support, but this patch is so small we might as well
> take it to keep things consistent between schedulers.
>

Thinking more about it. It probably can be useful for the case ryo is
pointing out where fast SSD drivers are not making use of kernel IO
schedulers.

If they want group io schduling support on these SSDs, they can modify their
driver to make use of hierarchical noop.

> Acked-by: Rik van Riel <riel@redhat.com>

Thanks
Vivek

>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing
@ 2009-09-10 17:32       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 17:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Sun, Aug 30, 2009 at 10:52:59PM -0400, Rik van Riel wrote:
> Vivek Goyal wrote:
>> This patch changes noop to use queue scheduling code from elevator layer.
>> One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
>>
>> Signed-off-by: Nauman Rafique <nauman@google.com>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Not sure why noop needs hierarchical fair queueing
> support, but this patch is so small we might as well
> take it to keep things consistent between schedulers.
>

Thinking more about it. It probably can be useful for the case ryo is
pointing out where fast SSD drivers are not making use of kernel IO
schedulers.

If they want group io schduling support on these SSDs, they can modify their
driver to make use of hierarchical noop.

> Acked-by: Rik van Riel <riel@redhat.com>

Thanks
Vivek

>
> -- 
> All rights reversed.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]     ` <4A9F3319.8040509-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-09-10 20:11       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:11 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 02, 2009 at 11:08:09PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> > +static struct io_group *io_find_alloc_group(struct request_queue *q,
> > +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> > +			int create)
> > +{
> > +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > +	struct io_group *iog = NULL;
> > +	/* Note: Use efqd as key */
> > +	void *key = efqd;
> > +
> > +	/*
> > +	 * Take a refenrece to css object. Don't want to map a bio to
> > +	 * a group if it has been marked for deletion
> > +	 */
> > +
> > +	if (!css_tryget(&iocg->css))
> > +		return iog;
> 
> cgroup_to_io_cgroup() returns NULL if only blkio subsystem
> is mounted but io subsystem is not.  It can cause NULL pointer
> access.
> 

Good catch Muuhh.  Thanks. Applied for next release.

Thanks
Vivek

> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
> ---
>  block/elevator-fq.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..6714e73 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
>          * a group if it has been marked for deletion
>          */
>  
> -       if (!css_tryget(&iocg->css))
> +       if (!iocg || !css_tryget(&iocg->css))
>                 return iog;
>  
>         iog = io_cgroup_lookup_group(iocg, key);
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-09-03  3:08     ` Munehiro Ikeda
@ 2009-09-10 20:11       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:11 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 02, 2009 at 11:08:09PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> > +static struct io_group *io_find_alloc_group(struct request_queue *q,
> > +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> > +			int create)
> > +{
> > +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > +	struct io_group *iog = NULL;
> > +	/* Note: Use efqd as key */
> > +	void *key = efqd;
> > +
> > +	/*
> > +	 * Take a refenrece to css object. Don't want to map a bio to
> > +	 * a group if it has been marked for deletion
> > +	 */
> > +
> > +	if (!css_tryget(&iocg->css))
> > +		return iog;
> 
> cgroup_to_io_cgroup() returns NULL if only blkio subsystem
> is mounted but io subsystem is not.  It can cause NULL pointer
> access.
> 

Good catch Muuhh.  Thanks. Applied for next release.

Thanks
Vivek

> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/elevator-fq.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..6714e73 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
>          * a group if it has been marked for deletion
>          */
>  
> -       if (!css_tryget(&iocg->css))
> +       if (!iocg || !css_tryget(&iocg->css))
>                 return iog;
>  
>         iog = io_cgroup_lookup_group(iocg, key);
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-09-10 20:11       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:11 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida,
	containers, linux-kernel, akpm, torvalds

On Wed, Sep 02, 2009 at 11:08:09PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 08/28/2009 05:30 PM:
> > +static struct io_group *io_find_alloc_group(struct request_queue *q,
> > +			struct cgroup *cgroup, struct elv_fq_data *efqd,
> > +			int create)
> > +{
> > +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > +	struct io_group *iog = NULL;
> > +	/* Note: Use efqd as key */
> > +	void *key = efqd;
> > +
> > +	/*
> > +	 * Take a refenrece to css object. Don't want to map a bio to
> > +	 * a group if it has been marked for deletion
> > +	 */
> > +
> > +	if (!css_tryget(&iocg->css))
> > +		return iog;
> 
> cgroup_to_io_cgroup() returns NULL if only blkio subsystem
> is mounted but io subsystem is not.  It can cause NULL pointer
> access.
> 

Good catch Muuhh.  Thanks. Applied for next release.

Thanks
Vivek

> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/elevator-fq.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..6714e73 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1827,7 +1827,7 @@ static struct io_group *io_find_alloc_group(struct request_queue *q,
>          * a group if it has been marked for deletion
>          */
>  
> -       if (!css_tryget(&iocg->css))
> +       if (!iocg || !css_tryget(&iocg->css))
>                 return iog;
>  
>         iog = io_cgroup_lookup_group(iocg, key);
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4AA918C1.6070907-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 20:52     ` Vivek Goyal
  2009-09-13 18:54     ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:52 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler.

Thanks Jerome.  

> There is big improvement for noop and deadline schedulers
> (why is that happening?).

I think because now related IO is in a single queue and it gets to run
for 100ms or so (like CFQ). So previously, IO from both the instances
will go into a single queue which should lead to more seeks as requests
from two groups will kind of get interleaved.

With io controller, both groups have separate queues so requests from
both the data based instances will not get interleaved (This almost
becomes like CFQ where ther are separate queues for each io context
and for sequential reader, one io context gets to run nicely for certain
ms based on its priority).

> The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

I will run some tests with AS and see if I can reproduce this lower
performance and attribute it to a particular piece of code.

> 
> Results with 2 groups of same io policy (BE), different io weights and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> 
> The result in term of fairness is close to what we can expect from the
> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> the first request get 2/3 (4/5) of io time as long as it runs and thus
> finish in about 3/4 (5/8) of total time. 
> 

Jerome, after 36.6 seconds, disk will be fully given to second group.
Hence these times might not reflect the accurate measure of who got how
much of disk time.

Can you just capture the output of "io.disk_time" file in both the cgroups
at the time of completion of task in higher weight group. Alternatively,
you can just run this a script in a loop which prints the output of
 "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
see how disk times are being distributed between groups.

> 
> Results  with 2 groups of different io policies, same io weight and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> 
> Here again, the result in term of fairness is very close from what we
> expect.

Same as above in this case too.

These seem to be good test for fairness measurement in case of streaming 
readers. I think one more interesting test case will be do how are the 
random read latencies in case of multiple streaming readers present.

So if we can launch 4-5 dd processes in one group and then issue some
random small queueries on postgresql in second group, I am keen to see
how quickly the query can be completed with and without io controller.
Would be interesting to see at results for all 4 io schedulers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-10 15:18 ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
@ 2009-09-10 20:52     ` Vivek Goyal
  2009-09-13 18:54     ` Vivek Goyal
       [not found]   ` <4AA918C1.6070907-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:52 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler.

Thanks Jerome.  

> There is big improvement for noop and deadline schedulers
> (why is that happening?).

I think because now related IO is in a single queue and it gets to run
for 100ms or so (like CFQ). So previously, IO from both the instances
will go into a single queue which should lead to more seeks as requests
from two groups will kind of get interleaved.

With io controller, both groups have separate queues so requests from
both the data based instances will not get interleaved (This almost
becomes like CFQ where ther are separate queues for each io context
and for sequential reader, one io context gets to run nicely for certain
ms based on its priority).

> The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

I will run some tests with AS and see if I can reproduce this lower
performance and attribute it to a particular piece of code.

> 
> Results with 2 groups of same io policy (BE), different io weights and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> 
> The result in term of fairness is close to what we can expect from the
> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> the first request get 2/3 (4/5) of io time as long as it runs and thus
> finish in about 3/4 (5/8) of total time. 
> 

Jerome, after 36.6 seconds, disk will be fully given to second group.
Hence these times might not reflect the accurate measure of who got how
much of disk time.

Can you just capture the output of "io.disk_time" file in both the cgroups
at the time of completion of task in higher weight group. Alternatively,
you can just run this a script in a loop which prints the output of
 "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
see how disk times are being distributed between groups.

> 
> Results  with 2 groups of different io policies, same io weight and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> 
> Here again, the result in term of fairness is very close from what we
> expect.

Same as above in this case too.

These seem to be good test for fairness measurement in case of streaming 
readers. I think one more interesting test case will be do how are the 
random read latencies in case of multiple streaming readers present.

So if we can launch 4-5 dd processes in one group and then issue some
random small queueries on postgresql in second group, I am keen to see
how quickly the query can be completed with and without io controller.
Would be interesting to see at results for all 4 io schedulers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-10 20:52     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:52 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler.

Thanks Jerome.  

> There is big improvement for noop and deadline schedulers
> (why is that happening?).

I think because now related IO is in a single queue and it gets to run
for 100ms or so (like CFQ). So previously, IO from both the instances
will go into a single queue which should lead to more seeks as requests
from two groups will kind of get interleaved.

With io controller, both groups have separate queues so requests from
both the data based instances will not get interleaved (This almost
becomes like CFQ where ther are separate queues for each io context
and for sequential reader, one io context gets to run nicely for certain
ms based on its priority).

> The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

I will run some tests with AS and see if I can reproduce this lower
performance and attribute it to a particular piece of code.

> 
> Results with 2 groups of same io policy (BE), different io weights and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> 
> The result in term of fairness is close to what we can expect from the
> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> the first request get 2/3 (4/5) of io time as long as it runs and thus
> finish in about 3/4 (5/8) of total time. 
> 

Jerome, after 36.6 seconds, disk will be fully given to second group.
Hence these times might not reflect the accurate measure of who got how
much of disk time.

Can you just capture the output of "io.disk_time" file in both the cgroups
at the time of completion of task in higher weight group. Alternatively,
you can just run this a script in a loop which prints the output of
 "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
see how disk times are being distributed between groups.

> 
> Results  with 2 groups of different io policies, same io weight and
> CFQ scheduler:
> 			io-scheduler v8		io-scheduler v9
> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> 
> Here again, the result in term of fairness is very close from what we
> expect.

Same as above in this case too.

These seem to be good test for fairness measurement in case of streaming 
readers. I think one more interesting test case will be do how are the 
random read latencies in case of multiple streaming readers present.

So if we can launch 4-5 dd processes in one group and then issue some
random small queueries on postgresql in second group, I am keen to see
how quickly the query can be completed with and without io controller.
Would be interesting to see at results for all 4 io schedulers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]     ` <20090910205227.GB3617-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 20:56       ` Vivek Goyal
  2009-09-14 14:26         ` Jerome Marchand
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:56 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > Hi All,
> > > 
> > > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >  
> > Hi Vivek,
> > 
> > I've run some postgresql benchmarks for io-controller. Tests have been
> > made with 2.6.31-rc6 kernel, without io-controller patches (when
> > relevant) and with io-controller v8 and v9 patches.
> > I set up two instances of the TPC-H database, each running in their
> > own io-cgroup. I ran two clients to these databases and tested on each
> > that simple request:
> > $ select count(*) from LINEITEM;
> > where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > 720MB). That request generates a steady stream of IOs.
> > 
> > Time is measure by psql (\timing switched on). Each test is run twice
> > or more if there is any significant difference between the first two
> > runs. Before each run, the cache is flush:
> > $ echo 3 > /proc/sys/vm/drop_caches
> > 
> > 
> > Results with 2 groups of same io policy (BE) and same io weight (1000):
> > 
> > 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > 	first	second		first	second		first	second
> > 	DB	DB		DB	DB		DB	DB
> > 
> > CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > 
> > As you can see, there is no significant difference for CFQ
> > scheduler.
> 
> Thanks Jerome.  
> 
> > There is big improvement for noop and deadline schedulers
> > (why is that happening?).
> 
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
> 
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
> 
> > The performance with anticipatory scheduler
> > is a bit lower (~4%).
> > 

Hi Jerome, 

Can you also run the AS test with io controller patches and both the
database in root group (basically don't put them in to separate group). I 
suspect that this regression might come from that fact that we now have
to switch between queues and in AS we wait for request to finish from
previous queue before next queue is scheduled in and probably that is
slowing down things a bit.., just a wild guess..

Thanks
Vivek

> 
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
> 
> > 
> > Results with 2 groups of same io policy (BE), different io weights and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> > weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> > 
> > The result in term of fairness is close to what we can expect from the
> > ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> > the first request get 2/3 (4/5) of io time as long as it runs and thus
> > finish in about 3/4 (5/8) of total time. 
> > 
> 
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.
> 
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
> see how disk times are being distributed between groups.
> 
> > 
> > Results  with 2 groups of different io policies, same io weight and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> > policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> > 
> > Here again, the result in term of fairness is very close from what we
> > expect.
> 
> Same as above in this case too.
> 
> These seem to be good test for fairness measurement in case of streaming 
> readers. I think one more interesting test case will be do how are the 
> random read latencies in case of multiple streaming readers present.
> 
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-10 20:52     ` Vivek Goyal
@ 2009-09-10 20:56       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:56 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > Hi All,
> > > 
> > > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >  
> > Hi Vivek,
> > 
> > I've run some postgresql benchmarks for io-controller. Tests have been
> > made with 2.6.31-rc6 kernel, without io-controller patches (when
> > relevant) and with io-controller v8 and v9 patches.
> > I set up two instances of the TPC-H database, each running in their
> > own io-cgroup. I ran two clients to these databases and tested on each
> > that simple request:
> > $ select count(*) from LINEITEM;
> > where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > 720MB). That request generates a steady stream of IOs.
> > 
> > Time is measure by psql (\timing switched on). Each test is run twice
> > or more if there is any significant difference between the first two
> > runs. Before each run, the cache is flush:
> > $ echo 3 > /proc/sys/vm/drop_caches
> > 
> > 
> > Results with 2 groups of same io policy (BE) and same io weight (1000):
> > 
> > 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > 	first	second		first	second		first	second
> > 	DB	DB		DB	DB		DB	DB
> > 
> > CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > 
> > As you can see, there is no significant difference for CFQ
> > scheduler.
> 
> Thanks Jerome.  
> 
> > There is big improvement for noop and deadline schedulers
> > (why is that happening?).
> 
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
> 
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
> 
> > The performance with anticipatory scheduler
> > is a bit lower (~4%).
> > 

Hi Jerome, 

Can you also run the AS test with io controller patches and both the
database in root group (basically don't put them in to separate group). I 
suspect that this regression might come from that fact that we now have
to switch between queues and in AS we wait for request to finish from
previous queue before next queue is scheduled in and probably that is
slowing down things a bit.., just a wild guess..

Thanks
Vivek

> 
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
> 
> > 
> > Results with 2 groups of same io policy (BE), different io weights and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> > weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> > 
> > The result in term of fairness is close to what we can expect from the
> > ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> > the first request get 2/3 (4/5) of io time as long as it runs and thus
> > finish in about 3/4 (5/8) of total time. 
> > 
> 
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.
> 
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
> see how disk times are being distributed between groups.
> 
> > 
> > Results  with 2 groups of different io policies, same io weight and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> > policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> > 
> > Here again, the result in term of fairness is very close from what we
> > expect.
> 
> Same as above in this case too.
> 
> These seem to be good test for fairness measurement in case of streaming 
> readers. I think one more interesting test case will be do how are the 
> random read latencies in case of multiple streaming readers present.
> 
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-10 20:56       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-10 20:56 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > Hi All,
> > > 
> > > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >  
> > Hi Vivek,
> > 
> > I've run some postgresql benchmarks for io-controller. Tests have been
> > made with 2.6.31-rc6 kernel, without io-controller patches (when
> > relevant) and with io-controller v8 and v9 patches.
> > I set up two instances of the TPC-H database, each running in their
> > own io-cgroup. I ran two clients to these databases and tested on each
> > that simple request:
> > $ select count(*) from LINEITEM;
> > where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > 720MB). That request generates a steady stream of IOs.
> > 
> > Time is measure by psql (\timing switched on). Each test is run twice
> > or more if there is any significant difference between the first two
> > runs. Before each run, the cache is flush:
> > $ echo 3 > /proc/sys/vm/drop_caches
> > 
> > 
> > Results with 2 groups of same io policy (BE) and same io weight (1000):
> > 
> > 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > 	first	second		first	second		first	second
> > 	DB	DB		DB	DB		DB	DB
> > 
> > CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > 
> > As you can see, there is no significant difference for CFQ
> > scheduler.
> 
> Thanks Jerome.  
> 
> > There is big improvement for noop and deadline schedulers
> > (why is that happening?).
> 
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
> 
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
> 
> > The performance with anticipatory scheduler
> > is a bit lower (~4%).
> > 

Hi Jerome, 

Can you also run the AS test with io controller patches and both the
database in root group (basically don't put them in to separate group). I 
suspect that this regression might come from that fact that we now have
to switch between queues and in AS we wait for request to finish from
previous queue before next queue is scheduled in and probably that is
slowing down things a bit.., just a wild guess..

Thanks
Vivek

> 
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
> 
> > 
> > Results with 2 groups of same io policy (BE), different io weights and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
> > weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
> > 
> > The result in term of fairness is close to what we can expect from the
> > ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> > the first request get 2/3 (4/5) of io time as long as it runs and thus
> > finish in about 3/4 (5/8) of total time. 
> > 
> 
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.
> 
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
> see how disk times are being distributed between groups.
> 
> > 
> > Results  with 2 groups of different io policies, same io weight and
> > CFQ scheduler:
> > 			io-scheduler v8		io-scheduler v9
> > policy = RT, BE		22.5s	45.3s		22.4s	45.0s
> > policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
> > 
> > Here again, the result in term of fairness is very close from what we
> > expect.
> 
> Same as above in this case too.
> 
> These seem to be good test for fairness measurement in case of streaming 
> readers. I think one more interesting test case will be do how are the 
> random read latencies in case of multiple streaming readers present.
> 
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]         ` <20090909150537.GD8256-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-10  3:20           ` Gui Jianfeng
@ 2009-09-11  1:15           ` Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-11  1:15 UTC (permalink / raw)
  To: Vivek Goyal, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> I happened to encount a bug when i test IO Controller V9.
>>>> When there are three tasks to run concurrently in three group,
>>>> that is, one is parent group, and other two tasks are running 
>>>> in two different child groups respectively to read or write 
>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>> other tasks which access into "hdb" will also hang up.
>>>>
>>>> The bug only happens when using AS io scheduler.
>>>> The following scirpt can reproduce this bug in my box.
>>>>
>>> Hi Gui,
>>>
>>> I tried reproducing this on my system and can't reproduce it. All the
>>> three processes get killed and system does not hang.
>>>
>>> Can you please dig deeper a bit into it. 
>>>
>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>     Only when the task is trying do IO to disk it will hang up.
>>
>>> - Does io scheduler switch on the device work
>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>
>>> - If the system is not hung, can you capture the blktrace on the device.
>>>   Trace might give some idea, what's happening.
>> I run a "find" task to do some io on that disk, it seems that task hangs 
>> when it is issuing getdents() syscall.
>> kernel generates the following message:
>>
>> INFO: task find:3260 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> find          D a1e95787  1912  3260   2897 0x00000004
>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>> Call Trace:
>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>  [<c068ab68>] io_schedule+0x47/0x79
>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>> 1 lock held by find/3260:
>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>
>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> 
> Thanks Gui. As Jens said, it does look like a case of missing queue
> restart somewhere and now we are stuck, no requests are being dispatched
> to the disk and queue is already unplugged.
> 
> Can you please also try capturing the trace of events at io scheduler
> (blktrace) to see how did we get into that situation.
> 
> Are you using ide drivers and not libata? As jens said, I will try to make
> use of ide drivers and see if I can reproduce it.
> 

Hi Vivek, Jens,

Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
is still under service, and from now on, this ioq won't expire because "only root" optimization.
The following patch ensures the ioq do belongs to the root group if there's only root group existing.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..3f86552 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
 	}
 }
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
-	if (list_empty(&io_root_cgroup.css.cgroup->children))
+	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
+	    efqd->busy_queues == 1 && efqd->root_group->ioq)
 		return 1;
 
 	return 0;
@@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
 	return 1;
 }
@@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
 	struct elevator_queue *e = q->elevator;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
 	int ret = 1;
-
+	
 	if (e->ops->elevator_expire_ioq_fn) {
 		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
 							slice_expired, force);
@@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
 		goto keep_queue;
 
 	/* We are waiting for this group to become busy before it expires.*/
@@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
+		if (is_only_root_group(ioq->efqd) &&
 			elv_iosched_single_ioq(q->elevator)) {
 			elv_log_ioq(efqd, ioq, "select: only root group,"
 					" no expiry");
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-09 15:05         ` Vivek Goyal
                           ` (2 preceding siblings ...)
  (?)
@ 2009-09-11  1:15         ` Gui Jianfeng
  2009-09-14  2:44             ` Vivek Goyal
                             ` (2 more replies)
  -1 siblings, 3 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-11  1:15 UTC (permalink / raw)
  To: Vivek Goyal, jens.axboe
  Cc: linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> I happened to encount a bug when i test IO Controller V9.
>>>> When there are three tasks to run concurrently in three group,
>>>> that is, one is parent group, and other two tasks are running 
>>>> in two different child groups respectively to read or write 
>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>> other tasks which access into "hdb" will also hang up.
>>>>
>>>> The bug only happens when using AS io scheduler.
>>>> The following scirpt can reproduce this bug in my box.
>>>>
>>> Hi Gui,
>>>
>>> I tried reproducing this on my system and can't reproduce it. All the
>>> three processes get killed and system does not hang.
>>>
>>> Can you please dig deeper a bit into it. 
>>>
>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>     Only when the task is trying do IO to disk it will hang up.
>>
>>> - Does io scheduler switch on the device work
>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>
>>> - If the system is not hung, can you capture the blktrace on the device.
>>>   Trace might give some idea, what's happening.
>> I run a "find" task to do some io on that disk, it seems that task hangs 
>> when it is issuing getdents() syscall.
>> kernel generates the following message:
>>
>> INFO: task find:3260 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> find          D a1e95787  1912  3260   2897 0x00000004
>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>> Call Trace:
>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>  [<c068ab68>] io_schedule+0x47/0x79
>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>> 1 lock held by find/3260:
>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>
>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> 
> Thanks Gui. As Jens said, it does look like a case of missing queue
> restart somewhere and now we are stuck, no requests are being dispatched
> to the disk and queue is already unplugged.
> 
> Can you please also try capturing the trace of events at io scheduler
> (blktrace) to see how did we get into that situation.
> 
> Are you using ide drivers and not libata? As jens said, I will try to make
> use of ide drivers and see if I can reproduce it.
> 

Hi Vivek, Jens,

Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
is still under service, and from now on, this ioq won't expire because "only root" optimization.
The following patch ensures the ioq do belongs to the root group if there's only root group existing.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b723c12..3f86552 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
 	}
 }
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
-	if (list_empty(&io_root_cgroup.css.cgroup->children))
+	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
+	    efqd->busy_queues == 1 && efqd->root_group->ioq)
 		return 1;
 
 	return 0;
@@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
 int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
 EXPORT_SYMBOL(elv_iog_should_idle);
 
-static inline int is_only_root_group(void)
+static inline int is_only_root_group(struct elv_fq_data *efqd)
 {
 	return 1;
 }
@@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
 	struct elevator_queue *e = q->elevator;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
 	int ret = 1;
-
+	
 	if (e->ops->elevator_expire_ioq_fn) {
 		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
 							slice_expired, force);
@@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
 		goto keep_queue;
 
 	/* We are waiting for this group to become busy before it expires.*/
@@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
+		if (is_only_root_group(ioq->efqd) &&
 			elv_iosched_single_ioq(q->elevator)) {
 			elv_log_ioq(efqd, ioq, "select: only root group,"
 					" no expiry");
-- 
1.5.4.rc3







^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]       ` <20090910205657.GD3617-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11 13:16         ` Jerome Marchand
  0 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 13:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>  
>>> Hi Vivek,
>>>
>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>> relevant) and with io-controller v8 and v9 patches.
>>> I set up two instances of the TPC-H database, each running in their
>>> own io-cgroup. I ran two clients to these databases and tested on each
>>> that simple request:
>>> $ select count(*) from LINEITEM;
>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>> 720MB). That request generates a steady stream of IOs.
>>>
>>> Time is measure by psql (\timing switched on). Each test is run twice
>>> or more if there is any significant difference between the first two
>>> runs. Before each run, the cache is flush:
>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>
>>>
>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>
>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>> 	first	second		first	second		first	second
>>> 	DB	DB		DB	DB		DB	DB
>>>
>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>
>>> As you can see, there is no significant difference for CFQ
>>> scheduler.
>> Thanks Jerome.  
>>
>>> There is big improvement for noop and deadline schedulers
>>> (why is that happening?).
>> I think because now related IO is in a single queue and it gets to run
>> for 100ms or so (like CFQ). So previously, IO from both the instances
>> will go into a single queue which should lead to more seeks as requests
>> from two groups will kind of get interleaved.
>>
>> With io controller, both groups have separate queues so requests from
>> both the data based instances will not get interleaved (This almost
>> becomes like CFQ where ther are separate queues for each io context
>> and for sequential reader, one io context gets to run nicely for certain
>> ms based on its priority).
>>
>>> The performance with anticipatory scheduler
>>> is a bit lower (~4%).
>>>
> 
> Hi Jerome, 
> 
> Can you also run the AS test with io controller patches and both the
> database in root group (basically don't put them in to separate group). I 
> suspect that this regression might come from that fact that we now have
> to switch between queues and in AS we wait for request to finish from
> previous queue before next queue is scheduled in and probably that is
> slowing down things a bit.., just a wild guess..
> 

Hi Vivek,

I guess that's not the reason. I got 46.6s for both DB in root group with
io-controller v9 patches. I also rerun the test with DB in different groups
and found about the same result as above (48.3s and 48.6s).

Jerome



> Thanks
> Vivek
> 
>> I will run some tests with AS and see if I can reproduce this lower
>> performance and attribute it to a particular piece of code.
>>
>>> Results with 2 groups of same io policy (BE), different io weights and
>>> CFQ scheduler:
>>> 			io-scheduler v8		io-scheduler v9
>>> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
>>> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
>>>
>>> The result in term of fairness is close to what we can expect from the
>>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>>> finish in about 3/4 (5/8) of total time. 
>>>
>> Jerome, after 36.6 seconds, disk will be fully given to second group.
>> Hence these times might not reflect the accurate measure of who got how
>> much of disk time.
>>
>> Can you just capture the output of "io.disk_time" file in both the cgroups
>> at the time of completion of task in higher weight group. Alternatively,
>> you can just run this a script in a loop which prints the output of
>>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
>> see how disk times are being distributed between groups.
>>
>>> Results  with 2 groups of different io policies, same io weight and
>>> CFQ scheduler:
>>> 			io-scheduler v8		io-scheduler v9
>>> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
>>> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
>>>
>>> Here again, the result in term of fairness is very close from what we
>>> expect.
>> Same as above in this case too.
>>
>> These seem to be good test for fairness measurement in case of streaming 
>> readers. I think one more interesting test case will be do how are the 
>> random read latencies in case of multiple streaming readers present.
>>
>> So if we can launch 4-5 dd processes in one group and then issue some
>> random small queueries on postgresql in second group, I am keen to see
>> how quickly the query can be completed with and without io controller.
>> Would be interesting to see at results for all 4 io schedulers.
>>
>> Thanks
>> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-10 20:56       ` Vivek Goyal
  (?)
  (?)
@ 2009-09-11 13:16       ` Jerome Marchand
  2009-09-11 14:30           ` Vivek Goyal
       [not found]         ` <4AAA4DA7.8010909-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 2 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 13:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>  
>>> Hi Vivek,
>>>
>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>> relevant) and with io-controller v8 and v9 patches.
>>> I set up two instances of the TPC-H database, each running in their
>>> own io-cgroup. I ran two clients to these databases and tested on each
>>> that simple request:
>>> $ select count(*) from LINEITEM;
>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>> 720MB). That request generates a steady stream of IOs.
>>>
>>> Time is measure by psql (\timing switched on). Each test is run twice
>>> or more if there is any significant difference between the first two
>>> runs. Before each run, the cache is flush:
>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>
>>>
>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>
>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>> 	first	second		first	second		first	second
>>> 	DB	DB		DB	DB		DB	DB
>>>
>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>
>>> As you can see, there is no significant difference for CFQ
>>> scheduler.
>> Thanks Jerome.  
>>
>>> There is big improvement for noop and deadline schedulers
>>> (why is that happening?).
>> I think because now related IO is in a single queue and it gets to run
>> for 100ms or so (like CFQ). So previously, IO from both the instances
>> will go into a single queue which should lead to more seeks as requests
>> from two groups will kind of get interleaved.
>>
>> With io controller, both groups have separate queues so requests from
>> both the data based instances will not get interleaved (This almost
>> becomes like CFQ where ther are separate queues for each io context
>> and for sequential reader, one io context gets to run nicely for certain
>> ms based on its priority).
>>
>>> The performance with anticipatory scheduler
>>> is a bit lower (~4%).
>>>
> 
> Hi Jerome, 
> 
> Can you also run the AS test with io controller patches and both the
> database in root group (basically don't put them in to separate group). I 
> suspect that this regression might come from that fact that we now have
> to switch between queues and in AS we wait for request to finish from
> previous queue before next queue is scheduled in and probably that is
> slowing down things a bit.., just a wild guess..
> 

Hi Vivek,

I guess that's not the reason. I got 46.6s for both DB in root group with
io-controller v9 patches. I also rerun the test with DB in different groups
and found about the same result as above (48.3s and 48.6s).

Jerome



> Thanks
> Vivek
> 
>> I will run some tests with AS and see if I can reproduce this lower
>> performance and attribute it to a particular piece of code.
>>
>>> Results with 2 groups of same io policy (BE), different io weights and
>>> CFQ scheduler:
>>> 			io-scheduler v8		io-scheduler v9
>>> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
>>> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
>>>
>>> The result in term of fairness is close to what we can expect from the
>>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>>> finish in about 3/4 (5/8) of total time. 
>>>
>> Jerome, after 36.6 seconds, disk will be fully given to second group.
>> Hence these times might not reflect the accurate measure of who got how
>> much of disk time.
>>
>> Can you just capture the output of "io.disk_time" file in both the cgroups
>> at the time of completion of task in higher weight group. Alternatively,
>> you can just run this a script in a loop which prints the output of
>>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
>> see how disk times are being distributed between groups.
>>
>>> Results  with 2 groups of different io policies, same io weight and
>>> CFQ scheduler:
>>> 			io-scheduler v8		io-scheduler v9
>>> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
>>> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
>>>
>>> Here again, the result in term of fairness is very close from what we
>>> expect.
>> Same as above in this case too.
>>
>> These seem to be good test for fairness measurement in case of streaming 
>> readers. I think one more interesting test case will be do how are the 
>> random read latencies in case of multiple streaming readers present.
>>
>> So if we can launch 4-5 dd processes in one group and then issue some
>> random small queueries on postgresql in second group, I am keen to see
>> how quickly the query can be completed with and without io controller.
>> Would be interesting to see at results for all 4 io schedulers.
>>
>> Thanks
>> Vivek


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]         ` <4AAA4DA7.8010909-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11 14:30           ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:30 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> Hi All,
> >>>>
> >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>  
> >>> Hi Vivek,
> >>>
> >>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>> relevant) and with io-controller v8 and v9 patches.
> >>> I set up two instances of the TPC-H database, each running in their
> >>> own io-cgroup. I ran two clients to these databases and tested on each
> >>> that simple request:
> >>> $ select count(*) from LINEITEM;
> >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>> 720MB). That request generates a steady stream of IOs.
> >>>
> >>> Time is measure by psql (\timing switched on). Each test is run twice
> >>> or more if there is any significant difference between the first two
> >>> runs. Before each run, the cache is flush:
> >>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>>
> >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>
> >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>> 	first	second		first	second		first	second
> >>> 	DB	DB		DB	DB		DB	DB
> >>>
> >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>
> >>> As you can see, there is no significant difference for CFQ
> >>> scheduler.
> >> Thanks Jerome.  
> >>
> >>> There is big improvement for noop and deadline schedulers
> >>> (why is that happening?).
> >> I think because now related IO is in a single queue and it gets to run
> >> for 100ms or so (like CFQ). So previously, IO from both the instances
> >> will go into a single queue which should lead to more seeks as requests
> >> from two groups will kind of get interleaved.
> >>
> >> With io controller, both groups have separate queues so requests from
> >> both the data based instances will not get interleaved (This almost
> >> becomes like CFQ where ther are separate queues for each io context
> >> and for sequential reader, one io context gets to run nicely for certain
> >> ms based on its priority).
> >>
> >>> The performance with anticipatory scheduler
> >>> is a bit lower (~4%).
> >>>
> > 
> > Hi Jerome, 
> > 
> > Can you also run the AS test with io controller patches and both the
> > database in root group (basically don't put them in to separate group). I 
> > suspect that this regression might come from that fact that we now have
> > to switch between queues and in AS we wait for request to finish from
> > previous queue before next queue is scheduled in and probably that is
> > slowing down things a bit.., just a wild guess..
> > 
> 
> Hi Vivek,
> 
> I guess that's not the reason. I got 46.6s for both DB in root group with
> io-controller v9 patches. I also rerun the test with DB in different groups
> and found about the same result as above (48.3s and 48.6s).
> 

Hi Jerome,

Ok, so when both the DB's are in root group (with io-controller V9
patches), then you get 46.6 seconds time for both the DBs. That means there
is no regression in this case. In this case there is only one queue of 
root group and AS is running timed read/write batches on this queue.

But when both the DBs are put in separate groups then you get 48.3 and
48.6 seconds respectively and we see regression. In this case there are
two queues belonging to each group. Elevator layer takes care of queue
group queue switch and AS runs timed read/write batches on these queues.

If it is correct, then it does not exclude the possiblity that it is queue
switching overhead between groups?
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-11 13:16       ` Jerome Marchand
@ 2009-09-11 14:30           ` Vivek Goyal
       [not found]         ` <4AAA4DA7.8010909-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:30 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> Hi All,
> >>>>
> >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>  
> >>> Hi Vivek,
> >>>
> >>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>> relevant) and with io-controller v8 and v9 patches.
> >>> I set up two instances of the TPC-H database, each running in their
> >>> own io-cgroup. I ran two clients to these databases and tested on each
> >>> that simple request:
> >>> $ select count(*) from LINEITEM;
> >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>> 720MB). That request generates a steady stream of IOs.
> >>>
> >>> Time is measure by psql (\timing switched on). Each test is run twice
> >>> or more if there is any significant difference between the first two
> >>> runs. Before each run, the cache is flush:
> >>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>>
> >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>
> >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>> 	first	second		first	second		first	second
> >>> 	DB	DB		DB	DB		DB	DB
> >>>
> >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>
> >>> As you can see, there is no significant difference for CFQ
> >>> scheduler.
> >> Thanks Jerome.  
> >>
> >>> There is big improvement for noop and deadline schedulers
> >>> (why is that happening?).
> >> I think because now related IO is in a single queue and it gets to run
> >> for 100ms or so (like CFQ). So previously, IO from both the instances
> >> will go into a single queue which should lead to more seeks as requests
> >> from two groups will kind of get interleaved.
> >>
> >> With io controller, both groups have separate queues so requests from
> >> both the data based instances will not get interleaved (This almost
> >> becomes like CFQ where ther are separate queues for each io context
> >> and for sequential reader, one io context gets to run nicely for certain
> >> ms based on its priority).
> >>
> >>> The performance with anticipatory scheduler
> >>> is a bit lower (~4%).
> >>>
> > 
> > Hi Jerome, 
> > 
> > Can you also run the AS test with io controller patches and both the
> > database in root group (basically don't put them in to separate group). I 
> > suspect that this regression might come from that fact that we now have
> > to switch between queues and in AS we wait for request to finish from
> > previous queue before next queue is scheduled in and probably that is
> > slowing down things a bit.., just a wild guess..
> > 
> 
> Hi Vivek,
> 
> I guess that's not the reason. I got 46.6s for both DB in root group with
> io-controller v9 patches. I also rerun the test with DB in different groups
> and found about the same result as above (48.3s and 48.6s).
> 

Hi Jerome,

Ok, so when both the DB's are in root group (with io-controller V9
patches), then you get 46.6 seconds time for both the DBs. That means there
is no regression in this case. In this case there is only one queue of 
root group and AS is running timed read/write batches on this queue.

But when both the DBs are put in separate groups then you get 48.3 and
48.6 seconds respectively and we see regression. In this case there are
two queues belonging to each group. Elevator layer takes care of queue
group queue switch and AS runs timed read/write batches on these queues.

If it is correct, then it does not exclude the possiblity that it is queue
switching overhead between groups?
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-11 14:30           ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:30 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> Hi All,
> >>>>
> >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>  
> >>> Hi Vivek,
> >>>
> >>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>> relevant) and with io-controller v8 and v9 patches.
> >>> I set up two instances of the TPC-H database, each running in their
> >>> own io-cgroup. I ran two clients to these databases and tested on each
> >>> that simple request:
> >>> $ select count(*) from LINEITEM;
> >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>> 720MB). That request generates a steady stream of IOs.
> >>>
> >>> Time is measure by psql (\timing switched on). Each test is run twice
> >>> or more if there is any significant difference between the first two
> >>> runs. Before each run, the cache is flush:
> >>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>>
> >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>
> >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>> 	first	second		first	second		first	second
> >>> 	DB	DB		DB	DB		DB	DB
> >>>
> >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>
> >>> As you can see, there is no significant difference for CFQ
> >>> scheduler.
> >> Thanks Jerome.  
> >>
> >>> There is big improvement for noop and deadline schedulers
> >>> (why is that happening?).
> >> I think because now related IO is in a single queue and it gets to run
> >> for 100ms or so (like CFQ). So previously, IO from both the instances
> >> will go into a single queue which should lead to more seeks as requests
> >> from two groups will kind of get interleaved.
> >>
> >> With io controller, both groups have separate queues so requests from
> >> both the data based instances will not get interleaved (This almost
> >> becomes like CFQ where ther are separate queues for each io context
> >> and for sequential reader, one io context gets to run nicely for certain
> >> ms based on its priority).
> >>
> >>> The performance with anticipatory scheduler
> >>> is a bit lower (~4%).
> >>>
> > 
> > Hi Jerome, 
> > 
> > Can you also run the AS test with io controller patches and both the
> > database in root group (basically don't put them in to separate group). I 
> > suspect that this regression might come from that fact that we now have
> > to switch between queues and in AS we wait for request to finish from
> > previous queue before next queue is scheduled in and probably that is
> > slowing down things a bit.., just a wild guess..
> > 
> 
> Hi Vivek,
> 
> I guess that's not the reason. I got 46.6s for both DB in root group with
> io-controller v9 patches. I also rerun the test with DB in different groups
> and found about the same result as above (48.3s and 48.6s).
> 

Hi Jerome,

Ok, so when both the DB's are in root group (with io-controller V9
patches), then you get 46.6 seconds time for both the DBs. That means there
is no regression in this case. In this case there is only one queue of 
root group and AS is running timed read/write batches on this queue.

But when both the DBs are put in separate groups then you get 48.3 and
48.6 seconds respectively and we see regression. In this case there are
two queues belonging to each group. Elevator layer takes care of queue
group queue switch and AS runs timed read/write batches on these queues.

If it is correct, then it does not exclude the possiblity that it is queue
switching overhead between groups?
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]           ` <20090911143040.GB6758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11 14:43             ` Vivek Goyal
  2009-09-11 14:44             ` Jerome Marchand
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:43 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> > >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > >>> Vivek Goyal wrote:
> > >>>> Hi All,
> > >>>>
> > >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > >>>  
> > >>> Hi Vivek,
> > >>>
> > >>> I've run some postgresql benchmarks for io-controller. Tests have been
> > >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> > >>> relevant) and with io-controller v8 and v9 patches.
> > >>> I set up two instances of the TPC-H database, each running in their
> > >>> own io-cgroup. I ran two clients to these databases and tested on each
> > >>> that simple request:
> > >>> $ select count(*) from LINEITEM;
> > >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > >>> 720MB). That request generates a steady stream of IOs.
> > >>>
> > >>> Time is measure by psql (\timing switched on). Each test is run twice
> > >>> or more if there is any significant difference between the first two
> > >>> runs. Before each run, the cache is flush:
> > >>> $ echo 3 > /proc/sys/vm/drop_caches
> > >>>
> > >>>
> > >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> > >>>
> > >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > >>> 	first	second		first	second		first	second
> > >>> 	DB	DB		DB	DB		DB	DB
> > >>>
> > >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > >>>
> > >>> As you can see, there is no significant difference for CFQ
> > >>> scheduler.
> > >> Thanks Jerome.  
> > >>
> > >>> There is big improvement for noop and deadline schedulers
> > >>> (why is that happening?).
> > >> I think because now related IO is in a single queue and it gets to run
> > >> for 100ms or so (like CFQ). So previously, IO from both the instances
> > >> will go into a single queue which should lead to more seeks as requests
> > >> from two groups will kind of get interleaved.
> > >>
> > >> With io controller, both groups have separate queues so requests from
> > >> both the data based instances will not get interleaved (This almost
> > >> becomes like CFQ where ther are separate queues for each io context
> > >> and for sequential reader, one io context gets to run nicely for certain
> > >> ms based on its priority).
> > >>
> > >>> The performance with anticipatory scheduler
> > >>> is a bit lower (~4%).
> > >>>
> > > 
> > > Hi Jerome, 
> > > 
> > > Can you also run the AS test with io controller patches and both the
> > > database in root group (basically don't put them in to separate group). I 
> > > suspect that this regression might come from that fact that we now have
> > > to switch between queues and in AS we wait for request to finish from
> > > previous queue before next queue is scheduled in and probably that is
> > > slowing down things a bit.., just a wild guess..
> > > 
> > 
> > Hi Vivek,
> > 
> > I guess that's not the reason. I got 46.6s for both DB in root group with
> > io-controller v9 patches. I also rerun the test with DB in different groups
> > and found about the same result as above (48.3s and 48.6s).
> > 
> 
> Hi Jerome,
> 
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of 
> root group and AS is running timed read/write batches on this queue.
> 
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
> 
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?
>  

Does your hard drive support command queuing? May be we are driving deeper
queue depths for reads and during queue switch we will wait for requests
to finish from last queue to finish before next queue is scheduled in (for
AS) and that probably will cause more delay if we are driving deeper queue
depth.

Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
this disk and see time consumed in two cases are same or different. I think
setting depth to "1" will bring down overall throughput but if times are same
in two cases, at least we will know where the delay is coming from.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-11 14:30           ` Vivek Goyal
@ 2009-09-11 14:43             ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:43 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> > >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > >>> Vivek Goyal wrote:
> > >>>> Hi All,
> > >>>>
> > >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > >>>  
> > >>> Hi Vivek,
> > >>>
> > >>> I've run some postgresql benchmarks for io-controller. Tests have been
> > >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> > >>> relevant) and with io-controller v8 and v9 patches.
> > >>> I set up two instances of the TPC-H database, each running in their
> > >>> own io-cgroup. I ran two clients to these databases and tested on each
> > >>> that simple request:
> > >>> $ select count(*) from LINEITEM;
> > >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > >>> 720MB). That request generates a steady stream of IOs.
> > >>>
> > >>> Time is measure by psql (\timing switched on). Each test is run twice
> > >>> or more if there is any significant difference between the first two
> > >>> runs. Before each run, the cache is flush:
> > >>> $ echo 3 > /proc/sys/vm/drop_caches
> > >>>
> > >>>
> > >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> > >>>
> > >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > >>> 	first	second		first	second		first	second
> > >>> 	DB	DB		DB	DB		DB	DB
> > >>>
> > >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > >>>
> > >>> As you can see, there is no significant difference for CFQ
> > >>> scheduler.
> > >> Thanks Jerome.  
> > >>
> > >>> There is big improvement for noop and deadline schedulers
> > >>> (why is that happening?).
> > >> I think because now related IO is in a single queue and it gets to run
> > >> for 100ms or so (like CFQ). So previously, IO from both the instances
> > >> will go into a single queue which should lead to more seeks as requests
> > >> from two groups will kind of get interleaved.
> > >>
> > >> With io controller, both groups have separate queues so requests from
> > >> both the data based instances will not get interleaved (This almost
> > >> becomes like CFQ where ther are separate queues for each io context
> > >> and for sequential reader, one io context gets to run nicely for certain
> > >> ms based on its priority).
> > >>
> > >>> The performance with anticipatory scheduler
> > >>> is a bit lower (~4%).
> > >>>
> > > 
> > > Hi Jerome, 
> > > 
> > > Can you also run the AS test with io controller patches and both the
> > > database in root group (basically don't put them in to separate group). I 
> > > suspect that this regression might come from that fact that we now have
> > > to switch between queues and in AS we wait for request to finish from
> > > previous queue before next queue is scheduled in and probably that is
> > > slowing down things a bit.., just a wild guess..
> > > 
> > 
> > Hi Vivek,
> > 
> > I guess that's not the reason. I got 46.6s for both DB in root group with
> > io-controller v9 patches. I also rerun the test with DB in different groups
> > and found about the same result as above (48.3s and 48.6s).
> > 
> 
> Hi Jerome,
> 
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of 
> root group and AS is running timed read/write batches on this queue.
> 
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
> 
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?
>  

Does your hard drive support command queuing? May be we are driving deeper
queue depths for reads and during queue switch we will wait for requests
to finish from last queue to finish before next queue is scheduled in (for
AS) and that probably will cause more delay if we are driving deeper queue
depth.

Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
this disk and see time consumed in two cases are same or different. I think
setting depth to "1" will bring down overall throughput but if times are same
in two cases, at least we will know where the delay is coming from.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-11 14:43             ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 14:43 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> > >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > >>> Vivek Goyal wrote:
> > >>>> Hi All,
> > >>>>
> > >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > >>>  
> > >>> Hi Vivek,
> > >>>
> > >>> I've run some postgresql benchmarks for io-controller. Tests have been
> > >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> > >>> relevant) and with io-controller v8 and v9 patches.
> > >>> I set up two instances of the TPC-H database, each running in their
> > >>> own io-cgroup. I ran two clients to these databases and tested on each
> > >>> that simple request:
> > >>> $ select count(*) from LINEITEM;
> > >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > >>> 720MB). That request generates a steady stream of IOs.
> > >>>
> > >>> Time is measure by psql (\timing switched on). Each test is run twice
> > >>> or more if there is any significant difference between the first two
> > >>> runs. Before each run, the cache is flush:
> > >>> $ echo 3 > /proc/sys/vm/drop_caches
> > >>>
> > >>>
> > >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> > >>>
> > >>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> > >>> 	first	second		first	second		first	second
> > >>> 	DB	DB		DB	DB		DB	DB
> > >>>
> > >>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> > >>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> > >>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> > >>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> > >>>
> > >>> As you can see, there is no significant difference for CFQ
> > >>> scheduler.
> > >> Thanks Jerome.  
> > >>
> > >>> There is big improvement for noop and deadline schedulers
> > >>> (why is that happening?).
> > >> I think because now related IO is in a single queue and it gets to run
> > >> for 100ms or so (like CFQ). So previously, IO from both the instances
> > >> will go into a single queue which should lead to more seeks as requests
> > >> from two groups will kind of get interleaved.
> > >>
> > >> With io controller, both groups have separate queues so requests from
> > >> both the data based instances will not get interleaved (This almost
> > >> becomes like CFQ where ther are separate queues for each io context
> > >> and for sequential reader, one io context gets to run nicely for certain
> > >> ms based on its priority).
> > >>
> > >>> The performance with anticipatory scheduler
> > >>> is a bit lower (~4%).
> > >>>
> > > 
> > > Hi Jerome, 
> > > 
> > > Can you also run the AS test with io controller patches and both the
> > > database in root group (basically don't put them in to separate group). I 
> > > suspect that this regression might come from that fact that we now have
> > > to switch between queues and in AS we wait for request to finish from
> > > previous queue before next queue is scheduled in and probably that is
> > > slowing down things a bit.., just a wild guess..
> > > 
> > 
> > Hi Vivek,
> > 
> > I guess that's not the reason. I got 46.6s for both DB in root group with
> > io-controller v9 patches. I also rerun the test with DB in different groups
> > and found about the same result as above (48.3s and 48.6s).
> > 
> 
> Hi Jerome,
> 
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of 
> root group and AS is running timed read/write batches on this queue.
> 
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
> 
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?
>  

Does your hard drive support command queuing? May be we are driving deeper
queue depths for reads and during queue switch we will wait for requests
to finish from last queue to finish before next queue is scheduled in (for
AS) and that probably will cause more delay if we are driving deeper queue
depth.

Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
this disk and see time consumed in two cases are same or different. I think
setting depth to "1" will bring down overall throughput but if times are same
in two cases, at least we will know where the delay is coming from.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]           ` <20090911143040.GB6758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-11 14:43             ` Vivek Goyal
@ 2009-09-11 14:44             ` Jerome Marchand
  1 sibling, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 14:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>> Vivek Goyal wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>  
>>>>> Hi Vivek,
>>>>>
>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>> I set up two instances of the TPC-H database, each running in their
>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>> that simple request:
>>>>> $ select count(*) from LINEITEM;
>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>
>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>> or more if there is any significant difference between the first two
>>>>> runs. Before each run, the cache is flush:
>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>
>>>>>
>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>
>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>>>> 	first	second		first	second		first	second
>>>>> 	DB	DB		DB	DB		DB	DB
>>>>>
>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>>>
>>>>> As you can see, there is no significant difference for CFQ
>>>>> scheduler.
>>>> Thanks Jerome.  
>>>>
>>>>> There is big improvement for noop and deadline schedulers
>>>>> (why is that happening?).
>>>> I think because now related IO is in a single queue and it gets to run
>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>> will go into a single queue which should lead to more seeks as requests
>>>> from two groups will kind of get interleaved.
>>>>
>>>> With io controller, both groups have separate queues so requests from
>>>> both the data based instances will not get interleaved (This almost
>>>> becomes like CFQ where ther are separate queues for each io context
>>>> and for sequential reader, one io context gets to run nicely for certain
>>>> ms based on its priority).
>>>>
>>>>> The performance with anticipatory scheduler
>>>>> is a bit lower (~4%).
>>>>>
>>> Hi Jerome, 
>>>
>>> Can you also run the AS test with io controller patches and both the
>>> database in root group (basically don't put them in to separate group). I 
>>> suspect that this regression might come from that fact that we now have
>>> to switch between queues and in AS we wait for request to finish from
>>> previous queue before next queue is scheduled in and probably that is
>>> slowing down things a bit.., just a wild guess..
>>>
>> Hi Vivek,
>>
>> I guess that's not the reason. I got 46.6s for both DB in root group with
>> io-controller v9 patches. I also rerun the test with DB in different groups
>> and found about the same result as above (48.3s and 48.6s).
>>
> 
> Hi Jerome,
> 
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of 
> root group and AS is running timed read/write batches on this queue.
> 
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
> 
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?

Yes it's correct. I misunderstood you.

Jerome

>  
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-11 14:30           ` Vivek Goyal
                             ` (2 preceding siblings ...)
  (?)
@ 2009-09-11 14:44           ` Jerome Marchand
  -1 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 14:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>> Vivek Goyal wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>  
>>>>> Hi Vivek,
>>>>>
>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>> I set up two instances of the TPC-H database, each running in their
>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>> that simple request:
>>>>> $ select count(*) from LINEITEM;
>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>
>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>> or more if there is any significant difference between the first two
>>>>> runs. Before each run, the cache is flush:
>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>
>>>>>
>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>
>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>>>> 	first	second		first	second		first	second
>>>>> 	DB	DB		DB	DB		DB	DB
>>>>>
>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>>>
>>>>> As you can see, there is no significant difference for CFQ
>>>>> scheduler.
>>>> Thanks Jerome.  
>>>>
>>>>> There is big improvement for noop and deadline schedulers
>>>>> (why is that happening?).
>>>> I think because now related IO is in a single queue and it gets to run
>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>> will go into a single queue which should lead to more seeks as requests
>>>> from two groups will kind of get interleaved.
>>>>
>>>> With io controller, both groups have separate queues so requests from
>>>> both the data based instances will not get interleaved (This almost
>>>> becomes like CFQ where ther are separate queues for each io context
>>>> and for sequential reader, one io context gets to run nicely for certain
>>>> ms based on its priority).
>>>>
>>>>> The performance with anticipatory scheduler
>>>>> is a bit lower (~4%).
>>>>>
>>> Hi Jerome, 
>>>
>>> Can you also run the AS test with io controller patches and both the
>>> database in root group (basically don't put them in to separate group). I 
>>> suspect that this regression might come from that fact that we now have
>>> to switch between queues and in AS we wait for request to finish from
>>> previous queue before next queue is scheduled in and probably that is
>>> slowing down things a bit.., just a wild guess..
>>>
>> Hi Vivek,
>>
>> I guess that's not the reason. I got 46.6s for both DB in root group with
>> io-controller v9 patches. I also rerun the test with DB in different groups
>> and found about the same result as above (48.3s and 48.6s).
>>
> 
> Hi Jerome,
> 
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of 
> root group and AS is running timed read/write batches on this queue.
> 
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
> 
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?

Yes it's correct. I misunderstood you.

Jerome

>  
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-11 14:43             ` Vivek Goyal
@ 2009-09-11 14:55                 ` Jerome Marchand
  -1 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 14:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
>> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>>> Vivek Goyal wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>>  
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>>> I set up two instances of the TPC-H database, each running in their
>>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>>> that simple request:
>>>>>> $ select count(*) from LINEITEM;
>>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>>
>>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>>> or more if there is any significant difference between the first two
>>>>>> runs. Before each run, the cache is flush:
>>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>>
>>>>>>
>>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>>
>>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>>>>> 	first	second		first	second		first	second
>>>>>> 	DB	DB		DB	DB		DB	DB
>>>>>>
>>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>>>>
>>>>>> As you can see, there is no significant difference for CFQ
>>>>>> scheduler.
>>>>> Thanks Jerome.  
>>>>>
>>>>>> There is big improvement for noop and deadline schedulers
>>>>>> (why is that happening?).
>>>>> I think because now related IO is in a single queue and it gets to run
>>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>>> will go into a single queue which should lead to more seeks as requests
>>>>> from two groups will kind of get interleaved.
>>>>>
>>>>> With io controller, both groups have separate queues so requests from
>>>>> both the data based instances will not get interleaved (This almost
>>>>> becomes like CFQ where ther are separate queues for each io context
>>>>> and for sequential reader, one io context gets to run nicely for certain
>>>>> ms based on its priority).
>>>>>
>>>>>> The performance with anticipatory scheduler
>>>>>> is a bit lower (~4%).
>>>>>>
>>>> Hi Jerome, 
>>>>
>>>> Can you also run the AS test with io controller patches and both the
>>>> database in root group (basically don't put them in to separate group). I 
>>>> suspect that this regression might come from that fact that we now have
>>>> to switch between queues and in AS we wait for request to finish from
>>>> previous queue before next queue is scheduled in and probably that is
>>>> slowing down things a bit.., just a wild guess..
>>>>
>>> Hi Vivek,
>>>
>>> I guess that's not the reason. I got 46.6s for both DB in root group with
>>> io-controller v9 patches. I also rerun the test with DB in different groups
>>> and found about the same result as above (48.3s and 48.6s).
>>>
>> Hi Jerome,
>>
>> Ok, so when both the DB's are in root group (with io-controller V9
>> patches), then you get 46.6 seconds time for both the DBs. That means there
>> is no regression in this case. In this case there is only one queue of 
>> root group and AS is running timed read/write batches on this queue.
>>
>> But when both the DBs are put in separate groups then you get 48.3 and
>> 48.6 seconds respectively and we see regression. In this case there are
>> two queues belonging to each group. Elevator layer takes care of queue
>> group queue switch and AS runs timed read/write batches on these queues.
>>
>> If it is correct, then it does not exclude the possiblity that it is queue
>> switching overhead between groups?
>>  
> 
> Does your hard drive support command queuing? May be we are driving deeper
> queue depths for reads and during queue switch we will wait for requests
> to finish from last queue to finish before next queue is scheduled in (for
> AS) and that probably will cause more delay if we are driving deeper queue
> depth.
> 
> Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> this disk and see time consumed in two cases are same or different. I think
> setting depth to "1" will bring down overall throughput but if times are same
> in two cases, at least we will know where the delay is coming from.
> 
> Thanks
> Vivek

It looks like command queuing is supported but disabled. Queue depth is already 1
and the file /sys/block/<disk>/device/queue_depth is read-only.

Jerome

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-11 14:55                 ` Jerome Marchand
  0 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-11 14:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
>> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>>> Vivek Goyal wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>>  
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>>> I set up two instances of the TPC-H database, each running in their
>>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>>> that simple request:
>>>>>> $ select count(*) from LINEITEM;
>>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>>
>>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>>> or more if there is any significant difference between the first two
>>>>>> runs. Before each run, the cache is flush:
>>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>>
>>>>>>
>>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>>
>>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>>>>>> 	first	second		first	second		first	second
>>>>>> 	DB	DB		DB	DB		DB	DB
>>>>>>
>>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>>>>>
>>>>>> As you can see, there is no significant difference for CFQ
>>>>>> scheduler.
>>>>> Thanks Jerome.  
>>>>>
>>>>>> There is big improvement for noop and deadline schedulers
>>>>>> (why is that happening?).
>>>>> I think because now related IO is in a single queue and it gets to run
>>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>>> will go into a single queue which should lead to more seeks as requests
>>>>> from two groups will kind of get interleaved.
>>>>>
>>>>> With io controller, both groups have separate queues so requests from
>>>>> both the data based instances will not get interleaved (This almost
>>>>> becomes like CFQ where ther are separate queues for each io context
>>>>> and for sequential reader, one io context gets to run nicely for certain
>>>>> ms based on its priority).
>>>>>
>>>>>> The performance with anticipatory scheduler
>>>>>> is a bit lower (~4%).
>>>>>>
>>>> Hi Jerome, 
>>>>
>>>> Can you also run the AS test with io controller patches and both the
>>>> database in root group (basically don't put them in to separate group). I 
>>>> suspect that this regression might come from that fact that we now have
>>>> to switch between queues and in AS we wait for request to finish from
>>>> previous queue before next queue is scheduled in and probably that is
>>>> slowing down things a bit.., just a wild guess..
>>>>
>>> Hi Vivek,
>>>
>>> I guess that's not the reason. I got 46.6s for both DB in root group with
>>> io-controller v9 patches. I also rerun the test with DB in different groups
>>> and found about the same result as above (48.3s and 48.6s).
>>>
>> Hi Jerome,
>>
>> Ok, so when both the DB's are in root group (with io-controller V9
>> patches), then you get 46.6 seconds time for both the DBs. That means there
>> is no regression in this case. In this case there is only one queue of 
>> root group and AS is running timed read/write batches on this queue.
>>
>> But when both the DBs are put in separate groups then you get 48.3 and
>> 48.6 seconds respectively and we see regression. In this case there are
>> two queues belonging to each group. Elevator layer takes care of queue
>> group queue switch and AS runs timed read/write batches on these queues.
>>
>> If it is correct, then it does not exclude the possiblity that it is queue
>> switching overhead between groups?
>>  
> 
> Does your hard drive support command queuing? May be we are driving deeper
> queue depths for reads and during queue switch we will wait for requests
> to finish from last queue to finish before next queue is scheduled in (for
> AS) and that probably will cause more delay if we are driving deeper queue
> depth.
> 
> Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> this disk and see time consumed in two cases are same or different. I think
> setting depth to "1" will bring down overall throughput but if times are same
> in two cases, at least we will know where the delay is coming from.
> 
> Thanks
> Vivek

It looks like command queuing is supported but disabled. Queue depth is already 1
and the file /sys/block/<disk>/device/queue_depth is read-only.

Jerome

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]                 ` <4AAA64F6.2050800-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11 15:01                   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 15:01 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 11, 2009 at 04:55:50PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> >> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>>>>> Vivek Goyal wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>>>>  
> >>>>>> Hi Vivek,
> >>>>>>
> >>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>>>>> relevant) and with io-controller v8 and v9 patches.
> >>>>>> I set up two instances of the TPC-H database, each running in their
> >>>>>> own io-cgroup. I ran two clients to these databases and tested on each
> >>>>>> that simple request:
> >>>>>> $ select count(*) from LINEITEM;
> >>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>>>>> 720MB). That request generates a steady stream of IOs.
> >>>>>>
> >>>>>> Time is measure by psql (\timing switched on). Each test is run twice
> >>>>>> or more if there is any significant difference between the first two
> >>>>>> runs. Before each run, the cache is flush:
> >>>>>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>>>>
> >>>>>>
> >>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>>>>
> >>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>>>>> 	first	second		first	second		first	second
> >>>>>> 	DB	DB		DB	DB		DB	DB
> >>>>>>
> >>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>>>>
> >>>>>> As you can see, there is no significant difference for CFQ
> >>>>>> scheduler.
> >>>>> Thanks Jerome.  
> >>>>>
> >>>>>> There is big improvement for noop and deadline schedulers
> >>>>>> (why is that happening?).
> >>>>> I think because now related IO is in a single queue and it gets to run
> >>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
> >>>>> will go into a single queue which should lead to more seeks as requests
> >>>>> from two groups will kind of get interleaved.
> >>>>>
> >>>>> With io controller, both groups have separate queues so requests from
> >>>>> both the data based instances will not get interleaved (This almost
> >>>>> becomes like CFQ where ther are separate queues for each io context
> >>>>> and for sequential reader, one io context gets to run nicely for certain
> >>>>> ms based on its priority).
> >>>>>
> >>>>>> The performance with anticipatory scheduler
> >>>>>> is a bit lower (~4%).
> >>>>>>
> >>>> Hi Jerome, 
> >>>>
> >>>> Can you also run the AS test with io controller patches and both the
> >>>> database in root group (basically don't put them in to separate group). I 
> >>>> suspect that this regression might come from that fact that we now have
> >>>> to switch between queues and in AS we wait for request to finish from
> >>>> previous queue before next queue is scheduled in and probably that is
> >>>> slowing down things a bit.., just a wild guess..
> >>>>
> >>> Hi Vivek,
> >>>
> >>> I guess that's not the reason. I got 46.6s for both DB in root group with
> >>> io-controller v9 patches. I also rerun the test with DB in different groups
> >>> and found about the same result as above (48.3s and 48.6s).
> >>>
> >> Hi Jerome,
> >>
> >> Ok, so when both the DB's are in root group (with io-controller V9
> >> patches), then you get 46.6 seconds time for both the DBs. That means there
> >> is no regression in this case. In this case there is only one queue of 
> >> root group and AS is running timed read/write batches on this queue.
> >>
> >> But when both the DBs are put in separate groups then you get 48.3 and
> >> 48.6 seconds respectively and we see regression. In this case there are
> >> two queues belonging to each group. Elevator layer takes care of queue
> >> group queue switch and AS runs timed read/write batches on these queues.
> >>
> >> If it is correct, then it does not exclude the possiblity that it is queue
> >> switching overhead between groups?
> >>  
> > 
> > Does your hard drive support command queuing? May be we are driving deeper
> > queue depths for reads and during queue switch we will wait for requests
> > to finish from last queue to finish before next queue is scheduled in (for
> > AS) and that probably will cause more delay if we are driving deeper queue
> > depth.
> > 
> > Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> > this disk and see time consumed in two cases are same or different. I think
> > setting depth to "1" will bring down overall throughput but if times are same
> > in two cases, at least we will know where the delay is coming from.
> > 
> > Thanks
> > Vivek
> 
> It looks like command queuing is supported but disabled. Queue depth is already 1
> and the file /sys/block/<disk>/device/queue_depth is read-only.

Hmm..., time to run blktraces and in both the cases and compare the two
cases and see what's the issue.

Would be great if you can capture and look at traces. Otherwise I will try
to do it sometime soon..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-11 14:55                 ` Jerome Marchand
@ 2009-09-11 15:01                   ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 15:01 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Fri, Sep 11, 2009 at 04:55:50PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> >> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>>>>> Vivek Goyal wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>>>>  
> >>>>>> Hi Vivek,
> >>>>>>
> >>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>>>>> relevant) and with io-controller v8 and v9 patches.
> >>>>>> I set up two instances of the TPC-H database, each running in their
> >>>>>> own io-cgroup. I ran two clients to these databases and tested on each
> >>>>>> that simple request:
> >>>>>> $ select count(*) from LINEITEM;
> >>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>>>>> 720MB). That request generates a steady stream of IOs.
> >>>>>>
> >>>>>> Time is measure by psql (\timing switched on). Each test is run twice
> >>>>>> or more if there is any significant difference between the first two
> >>>>>> runs. Before each run, the cache is flush:
> >>>>>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>>>>
> >>>>>>
> >>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>>>>
> >>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>>>>> 	first	second		first	second		first	second
> >>>>>> 	DB	DB		DB	DB		DB	DB
> >>>>>>
> >>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>>>>
> >>>>>> As you can see, there is no significant difference for CFQ
> >>>>>> scheduler.
> >>>>> Thanks Jerome.  
> >>>>>
> >>>>>> There is big improvement for noop and deadline schedulers
> >>>>>> (why is that happening?).
> >>>>> I think because now related IO is in a single queue and it gets to run
> >>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
> >>>>> will go into a single queue which should lead to more seeks as requests
> >>>>> from two groups will kind of get interleaved.
> >>>>>
> >>>>> With io controller, both groups have separate queues so requests from
> >>>>> both the data based instances will not get interleaved (This almost
> >>>>> becomes like CFQ where ther are separate queues for each io context
> >>>>> and for sequential reader, one io context gets to run nicely for certain
> >>>>> ms based on its priority).
> >>>>>
> >>>>>> The performance with anticipatory scheduler
> >>>>>> is a bit lower (~4%).
> >>>>>>
> >>>> Hi Jerome, 
> >>>>
> >>>> Can you also run the AS test with io controller patches and both the
> >>>> database in root group (basically don't put them in to separate group). I 
> >>>> suspect that this regression might come from that fact that we now have
> >>>> to switch between queues and in AS we wait for request to finish from
> >>>> previous queue before next queue is scheduled in and probably that is
> >>>> slowing down things a bit.., just a wild guess..
> >>>>
> >>> Hi Vivek,
> >>>
> >>> I guess that's not the reason. I got 46.6s for both DB in root group with
> >>> io-controller v9 patches. I also rerun the test with DB in different groups
> >>> and found about the same result as above (48.3s and 48.6s).
> >>>
> >> Hi Jerome,
> >>
> >> Ok, so when both the DB's are in root group (with io-controller V9
> >> patches), then you get 46.6 seconds time for both the DBs. That means there
> >> is no regression in this case. In this case there is only one queue of 
> >> root group and AS is running timed read/write batches on this queue.
> >>
> >> But when both the DBs are put in separate groups then you get 48.3 and
> >> 48.6 seconds respectively and we see regression. In this case there are
> >> two queues belonging to each group. Elevator layer takes care of queue
> >> group queue switch and AS runs timed read/write batches on these queues.
> >>
> >> If it is correct, then it does not exclude the possiblity that it is queue
> >> switching overhead between groups?
> >>  
> > 
> > Does your hard drive support command queuing? May be we are driving deeper
> > queue depths for reads and during queue switch we will wait for requests
> > to finish from last queue to finish before next queue is scheduled in (for
> > AS) and that probably will cause more delay if we are driving deeper queue
> > depth.
> > 
> > Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> > this disk and see time consumed in two cases are same or different. I think
> > setting depth to "1" will bring down overall throughput but if times are same
> > in two cases, at least we will know where the delay is coming from.
> > 
> > Thanks
> > Vivek
> 
> It looks like command queuing is supported but disabled. Queue depth is already 1
> and the file /sys/block/<disk>/device/queue_depth is read-only.

Hmm..., time to run blktraces and in both the cases and compare the two
cases and see what's the issue.

Would be great if you can capture and look at traces. Otherwise I will try
to do it sometime soon..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-11 15:01                   ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-11 15:01 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 11, 2009 at 04:55:50PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> >> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>>>>> Vivek Goyal wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>>>>  
> >>>>>> Hi Vivek,
> >>>>>>
> >>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>>>>> relevant) and with io-controller v8 and v9 patches.
> >>>>>> I set up two instances of the TPC-H database, each running in their
> >>>>>> own io-cgroup. I ran two clients to these databases and tested on each
> >>>>>> that simple request:
> >>>>>> $ select count(*) from LINEITEM;
> >>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>>>>> 720MB). That request generates a steady stream of IOs.
> >>>>>>
> >>>>>> Time is measure by psql (\timing switched on). Each test is run twice
> >>>>>> or more if there is any significant difference between the first two
> >>>>>> runs. Before each run, the cache is flush:
> >>>>>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>>>>
> >>>>>>
> >>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>>>>
> >>>>>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> >>>>>> 	first	second		first	second		first	second
> >>>>>> 	DB	DB		DB	DB		DB	DB
> >>>>>>
> >>>>>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> >>>>>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> >>>>>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> >>>>>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> >>>>>>
> >>>>>> As you can see, there is no significant difference for CFQ
> >>>>>> scheduler.
> >>>>> Thanks Jerome.  
> >>>>>
> >>>>>> There is big improvement for noop and deadline schedulers
> >>>>>> (why is that happening?).
> >>>>> I think because now related IO is in a single queue and it gets to run
> >>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
> >>>>> will go into a single queue which should lead to more seeks as requests
> >>>>> from two groups will kind of get interleaved.
> >>>>>
> >>>>> With io controller, both groups have separate queues so requests from
> >>>>> both the data based instances will not get interleaved (This almost
> >>>>> becomes like CFQ where ther are separate queues for each io context
> >>>>> and for sequential reader, one io context gets to run nicely for certain
> >>>>> ms based on its priority).
> >>>>>
> >>>>>> The performance with anticipatory scheduler
> >>>>>> is a bit lower (~4%).
> >>>>>>
> >>>> Hi Jerome, 
> >>>>
> >>>> Can you also run the AS test with io controller patches and both the
> >>>> database in root group (basically don't put them in to separate group). I 
> >>>> suspect that this regression might come from that fact that we now have
> >>>> to switch between queues and in AS we wait for request to finish from
> >>>> previous queue before next queue is scheduled in and probably that is
> >>>> slowing down things a bit.., just a wild guess..
> >>>>
> >>> Hi Vivek,
> >>>
> >>> I guess that's not the reason. I got 46.6s for both DB in root group with
> >>> io-controller v9 patches. I also rerun the test with DB in different groups
> >>> and found about the same result as above (48.3s and 48.6s).
> >>>
> >> Hi Jerome,
> >>
> >> Ok, so when both the DB's are in root group (with io-controller V9
> >> patches), then you get 46.6 seconds time for both the DBs. That means there
> >> is no regression in this case. In this case there is only one queue of 
> >> root group and AS is running timed read/write batches on this queue.
> >>
> >> But when both the DBs are put in separate groups then you get 48.3 and
> >> 48.6 seconds respectively and we see regression. In this case there are
> >> two queues belonging to each group. Elevator layer takes care of queue
> >> group queue switch and AS runs timed read/write batches on these queues.
> >>
> >> If it is correct, then it does not exclude the possiblity that it is queue
> >> switching overhead between groups?
> >>  
> > 
> > Does your hard drive support command queuing? May be we are driving deeper
> > queue depths for reads and during queue switch we will wait for requests
> > to finish from last queue to finish before next queue is scheduled in (for
> > AS) and that probably will cause more delay if we are driving deeper queue
> > depth.
> > 
> > Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> > this disk and see time consumed in two cases are same or different. I think
> > setting depth to "1" will bring down overall throughput but if times are same
> > in two cases, at least we will know where the delay is coming from.
> > 
> > Thanks
> > Vivek
> 
> It looks like command queuing is supported but disabled. Queue depth is already 1
> and the file /sys/block/<disk>/device/queue_depth is read-only.

Hmm..., time to run blktraces and in both the cases and compare the two
cases and see what's the issue.

Would be great if you can capture and look at traces. Otherwise I will try
to do it sometime soon..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
       [not found]   ` <4AA918C1.6070907-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-10 20:52     ` Vivek Goyal
@ 2009-09-13 18:54     ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-13 18:54 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler. There is big improvement for noop and deadline schedulers
> (why is that happening?). The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

Ok, I think what's happening here is that by default slice lenght for
a queue is 100ms. When you put two instances of DB in two different
groups, one streaming reader can run at max for 100ms at a go and then 
we switch to next reader.

But when both the readers are in root group, then AS lets run one reader
to run at max 250ms (sometimes 125ms and sometimes 250ms based on at what
time as_fifo_expired() was invoked).

So because a reader gets to run longer at one stretch in root group, it
reduces number of seeks and leads to little enhanced throughput.

If you change the /sys/block/<disk>/queue/iosched/slice_sync to 250 ms, then
one group queue can run at max for 250ms before we switch the queue. In
this case you should be able to get same performance as in root group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-10 15:18 ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
@ 2009-09-13 18:54     ` Vivek Goyal
  2009-09-13 18:54     ` Vivek Goyal
       [not found]   ` <4AA918C1.6070907-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-13 18:54 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler. There is big improvement for noop and deadline schedulers
> (why is that happening?). The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

Ok, I think what's happening here is that by default slice lenght for
a queue is 100ms. When you put two instances of DB in two different
groups, one streaming reader can run at max for 100ms at a go and then 
we switch to next reader.

But when both the readers are in root group, then AS lets run one reader
to run at max 250ms (sometimes 125ms and sometimes 250ms based on at what
time as_fifo_expired() was invoked).

So because a reader gets to run longer at one stretch in root group, it
reduces number of seeks and leads to little enhanced throughput.

If you change the /sys/block/<disk>/queue/iosched/slice_sync to 250 ms, then
one group queue can run at max for 250ms before we switch the queue. In
this case you should be able to get same performance as in root group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-13 18:54     ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-13 18:54 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>  
> Hi Vivek,
> 
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
> 
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
> 
> 
> Results with 2 groups of same io policy (BE) and same io weight (1000):
> 
> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
> 	first	second		first	second		first	second
> 	DB	DB		DB	DB		DB	DB
> 
> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
> 
> As you can see, there is no significant difference for CFQ
> scheduler. There is big improvement for noop and deadline schedulers
> (why is that happening?). The performance with anticipatory scheduler
> is a bit lower (~4%).
> 

Ok, I think what's happening here is that by default slice lenght for
a queue is 100ms. When you put two instances of DB in two different
groups, one streaming reader can run at max for 100ms at a go and then 
we switch to next reader.

But when both the readers are in root group, then AS lets run one reader
to run at max 250ms (sometimes 125ms and sometimes 250ms based on at what
time as_fifo_expired() was invoked).

So because a reader gets to run longer at one stretch in root group, it
reduces number of seeks and leads to little enhanced throughput.

If you change the /sys/block/<disk>/queue/iosched/slice_sync to 250 ms, then
one group queue can run at max for 250ms before we switch the queue. In
this case you should be able to get same performance as in root group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]           ` <4AA9A4BE.30005-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-14  2:44             ` Vivek Goyal
  2009-09-15  3:37             ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-14  2:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:

[..]
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

Cool. Good catch Gui. Queued for next posting.

Thanks
Vivek

> ---
>  block/elevator-fq.c |   13 +++++++------
>  1 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..3f86552 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
>  	}
>  }
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
> -	if (list_empty(&io_root_cgroup.css.cgroup->children))
> +	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
> +	    efqd->busy_queues == 1 && efqd->root_group->ioq)
>  		return 1;
>  
>  	return 0;
> @@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
>  int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
>  EXPORT_SYMBOL(elv_iog_should_idle);
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
>  	return 1;
>  }
> @@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
>  	struct elevator_queue *e = q->elevator;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
>  	int ret = 1;
> -
> +	
>  	if (e->ops->elevator_expire_ioq_fn) {
>  		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
>  							slice_expired, force);
> @@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	 * single queue ioschedulers (noop, deadline, AS).
>  	 */
>  
> -	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
> +	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
>  		goto keep_queue;
>  
>  	/* We are waiting for this group to become busy before it expires.*/
> @@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>  		 * unnecessary overhead.
>  		 */
>  
> -		if (is_only_root_group() &&
> +		if (is_only_root_group(ioq->efqd) &&
>  			elv_iosched_single_ioq(q->elevator)) {
>  			elv_log_ioq(efqd, ioq, "select: only root group,"
>  					" no expiry");
> -- 
> 1.5.4.rc3
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-11  1:15         ` Gui Jianfeng
@ 2009-09-14  2:44             ` Vivek Goyal
       [not found]           ` <4AA9A4BE.30005-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-09-15  3:37             ` Vivek Goyal
  2 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-14  2:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:

[..]
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Cool. Good catch Gui. Queued for next posting.

Thanks
Vivek

> ---
>  block/elevator-fq.c |   13 +++++++------
>  1 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..3f86552 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
>  	}
>  }
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
> -	if (list_empty(&io_root_cgroup.css.cgroup->children))
> +	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
> +	    efqd->busy_queues == 1 && efqd->root_group->ioq)
>  		return 1;
>  
>  	return 0;
> @@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
>  int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
>  EXPORT_SYMBOL(elv_iog_should_idle);
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
>  	return 1;
>  }
> @@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
>  	struct elevator_queue *e = q->elevator;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
>  	int ret = 1;
> -
> +	
>  	if (e->ops->elevator_expire_ioq_fn) {
>  		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
>  							slice_expired, force);
> @@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	 * single queue ioschedulers (noop, deadline, AS).
>  	 */
>  
> -	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
> +	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
>  		goto keep_queue;
>  
>  	/* We are waiting for this group to become busy before it expires.*/
> @@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>  		 * unnecessary overhead.
>  		 */
>  
> -		if (is_only_root_group() &&
> +		if (is_only_root_group(ioq->efqd) &&
>  			elv_iosched_single_ioq(q->elevator)) {
>  			elv_log_ioq(efqd, ioq, "select: only root group,"
>  					" no expiry");
> -- 
> 1.5.4.rc3
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
@ 2009-09-14  2:44             ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-14  2:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:

[..]
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Cool. Good catch Gui. Queued for next posting.

Thanks
Vivek

> ---
>  block/elevator-fq.c |   13 +++++++------
>  1 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index b723c12..3f86552 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2338,9 +2338,10 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
>  	}
>  }
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
> -	if (list_empty(&io_root_cgroup.css.cgroup->children))
> +	if (list_empty(&io_root_cgroup.css.cgroup->children) &&
> +	    efqd->busy_queues == 1 && efqd->root_group->ioq)
>  		return 1;
>  
>  	return 0;
> @@ -2383,7 +2384,7 @@ static void io_free_root_group(struct elevator_queue *e)
>  int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
>  EXPORT_SYMBOL(elv_iog_should_idle);
>  
> -static inline int is_only_root_group(void)
> +static inline int is_only_root_group(struct elv_fq_data *efqd)
>  {
>  	return 1;
>  }
> @@ -2547,7 +2548,7 @@ elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
>  	struct elevator_queue *e = q->elevator;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
>  	int ret = 1;
> -
> +	
>  	if (e->ops->elevator_expire_ioq_fn) {
>  		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
>  							slice_expired, force);
> @@ -2969,7 +2970,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	 * single queue ioschedulers (noop, deadline, AS).
>  	 */
>  
> -	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
> +	if (is_only_root_group(efqd) && elv_iosched_single_ioq(q->elevator))
>  		goto keep_queue;
>  
>  	/* We are waiting for this group to become busy before it expires.*/
> @@ -3180,7 +3181,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>  		 * unnecessary overhead.
>  		 */
>  
> -		if (is_only_root_group() &&
> +		if (is_only_root_group(ioq->efqd) &&
>  			elv_iosched_single_ioq(q->elevator)) {
>  			elv_log_ioq(efqd, ioq, "select: only root group,"
>  					" no expiry");
> -- 
> 1.5.4.rc3
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-10 20:52     ` Vivek Goyal
@ 2009-09-14 14:26         ` Jerome Marchand
  -1 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-14 14:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>  
>> Hi Vivek,
>>
>> I've run some postgresql benchmarks for io-controller. Tests have been
>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>> relevant) and with io-controller v8 and v9 patches.
>> I set up two instances of the TPC-H database, each running in their
>> own io-cgroup. I ran two clients to these databases and tested on each
>> that simple request:
>> $ select count(*) from LINEITEM;
>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>> 720MB). That request generates a steady stream of IOs.
>>
>> Time is measure by psql (\timing switched on). Each test is run twice
>> or more if there is any significant difference between the first two
>> runs. Before each run, the cache is flush:
>> $ echo 3 > /proc/sys/vm/drop_caches
>>
>>
>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>
>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>> 	first	second		first	second		first	second
>> 	DB	DB		DB	DB		DB	DB
>>
>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>
>> As you can see, there is no significant difference for CFQ
>> scheduler.
> 
> Thanks Jerome.  
> 
>> There is big improvement for noop and deadline schedulers
>> (why is that happening?).
> 
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
> 
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
> 
>> The performance with anticipatory scheduler
>> is a bit lower (~4%).
>>
> 
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
> 
>> Results with 2 groups of same io policy (BE), different io weights and
>> CFQ scheduler:
>> 			io-scheduler v8		io-scheduler v9
>> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
>> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
>>
>> The result in term of fairness is close to what we can expect from the
>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>> finish in about 3/4 (5/8) of total time. 
>>
> 
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.

I know and took it into account. Let me detail my calculations.

Both request are of the same size and takes alone a time T to complete
(about 22.5s in our example). For sake of simplification, let's ignore
switching cost. Then, the completion time of both requests running at the
same time would be 2T, whatever are their weights or classes.
If one group weights 1000 and the other 500 (resp. 250), the first group
gets 2/3 (4/5) of the bandwidth as long as it is running, and thus finished
in T/(2/3) = 2T*3/4 (resp. T/(4/5) = 2T*5/8 ) that is 3/4 (5/8) of the
total time. The other always finish in about 2T.

The actual results above are pretty closed to these theoretical values and
that how I concluded that the controller is pretty fair.

> 
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
> see how disk times are being distributed between groups.

Actually, I already check that and the result was good but I didn't keep
the output, so I just rerun the (1000,500) weights. First column is the 
time spent by first group since last refresh (refresh period is 2s).
The second column is the same for second group. The group test3 is not
used. The first "ratios" column is the ratio between io time spent by first
group and time spent by second group.

$ ./watch_cgroup.sh 2
test1: 0        test2: 0        test3: 0        ratios: --      --      --
test1: 805      test2: 301      test3: 0        ratios: 2.67441860465116279069  --      --
test1: 1209     test2: 714      test3: 0        ratios: 1.69327731092436974789  --      --
test1: 1306     test2: 503      test3: 0        ratios: 2.59642147117296222664  --      --
test1: 1210     test2: 604      test3: 0        ratios: 2.00331125827814569536  --      --
test1: 1207     test2: 605      test3: 0        ratios: 1.99504132231404958677  --      --
test1: 1209     test2: 605      test3: 0        ratios: 1.99834710743801652892  --      --
test1: 1206     test2: 606      test3: 0        ratios: 1.99009900990099009900  --      --
test1: 1109     test2: 607      test3: 0        ratios: 1.82701812191103789126  --      --
test1: 1213     test2: 603      test3: 0        ratios: 2.01160862354892205638  --      --
test1: 1214     test2: 608      test3: 0        ratios: 1.99671052631578947368  --      --
test1: 1211     test2: 603      test3: 0        ratios: 2.00829187396351575456  --      --
test1: 1110     test2: 603      test3: 0        ratios: 1.84079601990049751243  --      --
test1: 1210     test2: 605      test3: 0        ratios: 2.00000000000000000000  --      --
test1: 1211     test2: 601      test3: 0        ratios: 2.01497504159733777038  --      --
test1: 1210     test2: 607      test3: 0        ratios: 1.99341021416803953871  --      --
test1: 1204     test2: 604      test3: 0        ratios: 1.99337748344370860927  --      --
test1: 1207     test2: 605      test3: 0        ratios: 1.99504132231404958677  --      --
test1: 1089     test2: 708      test3: 0        ratios: 1.53813559322033898305  --      --
test1: 0        test2: 2124     test3: 0        ratios: 0       --      --
test1: 0        test2: 1915     test3: 0        ratios: 0       --      --
test1: 0        test2: 1919     test3: 0        ratios: 0       --      --
test1: 0        test2: 2023     test3: 0        ratios: 0       --      --
test1: 0        test2: 1925     test3: 0        ratios: 0       --      --
test1: 0        test2: 705      test3: 0        ratios: 0       --      --
test1: 0        test2: 0        test3: 0        ratios: --      --      --

As you can see, the ratio stays close to 2 as long as first request is
running.

Regards,
Jerome

> 
>> Results  with 2 groups of different io policies, same io weight and
>> CFQ scheduler:
>> 			io-scheduler v8		io-scheduler v9
>> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
>> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
>>
>> Here again, the result in term of fairness is very close from what we
>> expect.
> 
> Same as above in this case too.
> 
> These seem to be good test for fairness measurement in case of streaming 
> readers. I think one more interesting test case will be do how are the 
> random read latencies in case of multiple streaming readers present.
> 
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-14 14:26         ` Jerome Marchand
  0 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-14 14:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>  
>> Hi Vivek,
>>
>> I've run some postgresql benchmarks for io-controller. Tests have been
>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>> relevant) and with io-controller v8 and v9 patches.
>> I set up two instances of the TPC-H database, each running in their
>> own io-cgroup. I ran two clients to these databases and tested on each
>> that simple request:
>> $ select count(*) from LINEITEM;
>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>> 720MB). That request generates a steady stream of IOs.
>>
>> Time is measure by psql (\timing switched on). Each test is run twice
>> or more if there is any significant difference between the first two
>> runs. Before each run, the cache is flush:
>> $ echo 3 > /proc/sys/vm/drop_caches
>>
>>
>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>
>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>> 	first	second		first	second		first	second
>> 	DB	DB		DB	DB		DB	DB
>>
>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>
>> As you can see, there is no significant difference for CFQ
>> scheduler.
> 
> Thanks Jerome.  
> 
>> There is big improvement for noop and deadline schedulers
>> (why is that happening?).
> 
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
> 
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
> 
>> The performance with anticipatory scheduler
>> is a bit lower (~4%).
>>
> 
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
> 
>> Results with 2 groups of same io policy (BE), different io weights and
>> CFQ scheduler:
>> 			io-scheduler v8		io-scheduler v9
>> weights = 1000, 500	35.6s	46.7s		35.6s	46.7s
>> weigths = 1000, 250	29.2s	45.8s		29.2s	45.6s
>>
>> The result in term of fairness is close to what we can expect from the
>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>> finish in about 3/4 (5/8) of total time. 
>>
> 
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.

I know and took it into account. Let me detail my calculations.

Both request are of the same size and takes alone a time T to complete
(about 22.5s in our example). For sake of simplification, let's ignore
switching cost. Then, the completion time of both requests running at the
same time would be 2T, whatever are their weights or classes.
If one group weights 1000 and the other 500 (resp. 250), the first group
gets 2/3 (4/5) of the bandwidth as long as it is running, and thus finished
in T/(2/3) = 2T*3/4 (resp. T/(4/5) = 2T*5/8 ) that is 3/4 (5/8) of the
total time. The other always finish in about 2T.

The actual results above are pretty closed to these theoretical values and
that how I concluded that the controller is pretty fair.

> 
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
>  "cat io.disk_time | grep major:minor" every  2 seconds. That way we can
> see how disk times are being distributed between groups.

Actually, I already check that and the result was good but I didn't keep
the output, so I just rerun the (1000,500) weights. First column is the 
time spent by first group since last refresh (refresh period is 2s).
The second column is the same for second group. The group test3 is not
used. The first "ratios" column is the ratio between io time spent by first
group and time spent by second group.

$ ./watch_cgroup.sh 2
test1: 0        test2: 0        test3: 0        ratios: --      --      --
test1: 805      test2: 301      test3: 0        ratios: 2.67441860465116279069  --      --
test1: 1209     test2: 714      test3: 0        ratios: 1.69327731092436974789  --      --
test1: 1306     test2: 503      test3: 0        ratios: 2.59642147117296222664  --      --
test1: 1210     test2: 604      test3: 0        ratios: 2.00331125827814569536  --      --
test1: 1207     test2: 605      test3: 0        ratios: 1.99504132231404958677  --      --
test1: 1209     test2: 605      test3: 0        ratios: 1.99834710743801652892  --      --
test1: 1206     test2: 606      test3: 0        ratios: 1.99009900990099009900  --      --
test1: 1109     test2: 607      test3: 0        ratios: 1.82701812191103789126  --      --
test1: 1213     test2: 603      test3: 0        ratios: 2.01160862354892205638  --      --
test1: 1214     test2: 608      test3: 0        ratios: 1.99671052631578947368  --      --
test1: 1211     test2: 603      test3: 0        ratios: 2.00829187396351575456  --      --
test1: 1110     test2: 603      test3: 0        ratios: 1.84079601990049751243  --      --
test1: 1210     test2: 605      test3: 0        ratios: 2.00000000000000000000  --      --
test1: 1211     test2: 601      test3: 0        ratios: 2.01497504159733777038  --      --
test1: 1210     test2: 607      test3: 0        ratios: 1.99341021416803953871  --      --
test1: 1204     test2: 604      test3: 0        ratios: 1.99337748344370860927  --      --
test1: 1207     test2: 605      test3: 0        ratios: 1.99504132231404958677  --      --
test1: 1089     test2: 708      test3: 0        ratios: 1.53813559322033898305  --      --
test1: 0        test2: 2124     test3: 0        ratios: 0       --      --
test1: 0        test2: 1915     test3: 0        ratios: 0       --      --
test1: 0        test2: 1919     test3: 0        ratios: 0       --      --
test1: 0        test2: 2023     test3: 0        ratios: 0       --      --
test1: 0        test2: 1925     test3: 0        ratios: 0       --      --
test1: 0        test2: 705      test3: 0        ratios: 0       --      --
test1: 0        test2: 0        test3: 0        ratios: --      --      --

As you can see, the ratio stays close to 2 as long as first request is
running.

Regards,
Jerome

> 
>> Results  with 2 groups of different io policies, same io weight and
>> CFQ scheduler:
>> 			io-scheduler v8		io-scheduler v9
>> policy = RT, BE		22.5s	45.3s		22.4s	45.0s
>> policy = BE, IDLE	22.6s	44.8s		22.4s	45.0s
>>
>> Here again, the result in term of fairness is very close from what we
>> expect.
> 
> Same as above in this case too.
> 
> These seem to be good test for fairness measurement in case of streaming 
> readers. I think one more interesting test case will be do how are the 
> random read latencies in case of multiple streaming readers present.
> 
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
  2009-09-13 18:54     ` Vivek Goyal
@ 2009-09-14 14:31         ` Jerome Marchand
  -1 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-14 14:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>  
>> Hi Vivek,
>>
>> I've run some postgresql benchmarks for io-controller. Tests have been
>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>> relevant) and with io-controller v8 and v9 patches.
>> I set up two instances of the TPC-H database, each running in their
>> own io-cgroup. I ran two clients to these databases and tested on each
>> that simple request:
>> $ select count(*) from LINEITEM;
>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>> 720MB). That request generates a steady stream of IOs.
>>
>> Time is measure by psql (\timing switched on). Each test is run twice
>> or more if there is any significant difference between the first two
>> runs. Before each run, the cache is flush:
>> $ echo 3 > /proc/sys/vm/drop_caches
>>
>>
>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>
>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>> 	first	second		first	second		first	second
>> 	DB	DB		DB	DB		DB	DB
>>
>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>
>> As you can see, there is no significant difference for CFQ
>> scheduler. There is big improvement for noop and deadline schedulers
>> (why is that happening?). The performance with anticipatory scheduler
>> is a bit lower (~4%).
>>
> 
> Ok, I think what's happening here is that by default slice lenght for
> a queue is 100ms. When you put two instances of DB in two different
> groups, one streaming reader can run at max for 100ms at a go and then 
> we switch to next reader.
> 
> But when both the readers are in root group, then AS lets run one reader
> to run at max 250ms (sometimes 125ms and sometimes 250ms based on at what
> time as_fifo_expired() was invoked).
> 
> So because a reader gets to run longer at one stretch in root group, it
> reduces number of seeks and leads to little enhanced throughput.
> 
> If you change the /sys/block/<disk>/queue/iosched/slice_sync to 250 ms, then
> one group queue can run at max for 250ms before we switch the queue. In
> this case you should be able to get same performance as in root group.
> 
> Thanks
> Vivek

Indeed. When I run the benchmark with slice_sync = 250ms, I get results
close to the one for both instances running within the root group:
first group 46.1s and second group 46.4s.

Jerome

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [RFC] IO scheduler based IO controller V9
@ 2009-09-14 14:31         ` Jerome Marchand
  0 siblings, 0 replies; 322+ messages in thread
From: Jerome Marchand @ 2009-09-14 14:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, torvalds, mingo, riel

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>  
>> Hi Vivek,
>>
>> I've run some postgresql benchmarks for io-controller. Tests have been
>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>> relevant) and with io-controller v8 and v9 patches.
>> I set up two instances of the TPC-H database, each running in their
>> own io-cgroup. I ran two clients to these databases and tested on each
>> that simple request:
>> $ select count(*) from LINEITEM;
>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>> 720MB). That request generates a steady stream of IOs.
>>
>> Time is measure by psql (\timing switched on). Each test is run twice
>> or more if there is any significant difference between the first two
>> runs. Before each run, the cache is flush:
>> $ echo 3 > /proc/sys/vm/drop_caches
>>
>>
>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>
>> 	w/o io-scheduler	io-scheduler v8		io-scheduler v9
>> 	first	second		first	second		first	second
>> 	DB	DB		DB	DB		DB	DB
>>
>> CFQ	48.4s	48.4s		48.2s	48.2s		48.1s	48.5s
>> Noop	138.0s	138.0s		48.3s	48.4s		48.5s	48.8s
>> AS	46.3s	47.0s		48.5s	48.7s		48.3s	48.5s
>> Deadl.	137.1s	137.1s		48.2s	48.3s		48.3s	48.5s
>>
>> As you can see, there is no significant difference for CFQ
>> scheduler. There is big improvement for noop and deadline schedulers
>> (why is that happening?). The performance with anticipatory scheduler
>> is a bit lower (~4%).
>>
> 
> Ok, I think what's happening here is that by default slice lenght for
> a queue is 100ms. When you put two instances of DB in two different
> groups, one streaming reader can run at max for 100ms at a go and then 
> we switch to next reader.
> 
> But when both the readers are in root group, then AS lets run one reader
> to run at max 250ms (sometimes 125ms and sometimes 250ms based on at what
> time as_fifo_expired() was invoked).
> 
> So because a reader gets to run longer at one stretch in root group, it
> reduces number of seeks and leads to little enhanced throughput.
> 
> If you change the /sys/block/<disk>/queue/iosched/slice_sync to 250 ms, then
> one group queue can run at max for 250ms before we switch the queue. In
> this case you should be able to get same performance as in root group.
> 
> Thanks
> Vivek

Indeed. When I run the benchmark with slice_sync = 250ms, I get results
close to the one for both instances running within the root group:
first group 46.1s and second group 46.4s.

Jerome



^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
       [not found]   ` <1251495072-7780-21-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-08-31 17:54     ` Rik van Riel
@ 2009-09-14 18:33     ` Nauman Rafique
  1 sibling, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-14 18:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Aug 28, 2009 at 2:31 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> o Currently a request queue has got fixed number of request descriptors for
>  sync and async requests. Once the request descriptors are consumed, new
>  processes are put to sleep and they effectively become serialized. Because
>  sync and async queues are separate, async requests don't impact sync ones
>  but if one is looking for fairness between async requests, that is not
>  achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
>  going on in one cgroup, it does not impact the IO of other group.
>
> o This patch implements the per cgroup request descriptors. request pool per
>  queue is still common but every group will have its own wait list and its
>  own count of request descriptors allocated to that group for sync and async
>  queues. So effectively request_list becomes per io group property and not a
>  global request queue feature.
>
> o Currently one can define q->nr_requests to limit request descriptors
>  allocated for the queue. Now there is another tunable q->nr_group_requests
>  which controls the requests descriptr limit per group. q->nr_requests
>  supercedes q->nr_group_requests to make sure if there are lots of groups
>  present, we don't end up allocating too many request descriptors on the
>  queue.
>
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
>  block/blk-settings.c         |    1 +
>  block/blk-sysfs.c            |   59 ++++++--
>  block/elevator-fq.c          |   36 +++++
>  block/elevator-fq.h          |   29 ++++
>  block/elevator.c             |    7 +-
>  include/linux/blkdev.h       |   47 ++++++-
>  include/trace/events/block.h |    6 +-
>  kernel/trace/blktrace.c      |    6 +-
>  9 files changed, 421 insertions(+), 87 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 47cce59..18b400b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(blk_cleanup_queue);
>
> -static int blk_init_free_list(struct request_queue *q)
> +struct request_list *
> +blk_get_request_list(struct request_queue *q, struct bio *bio)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       /*
> +        * Determine which request list bio will be allocated from. This
> +        * is dependent on which io group bio belongs to
> +        */
> +       return elv_get_request_list_bio(q, bio);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       int priv = rq->cmd_flags & REQ_ELVPRIV;
> +
> +       return elv_get_request_list_rq(q, rq, priv);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +void blk_init_request_list(struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
>
>        rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
> -       rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
> -       rl->elvpriv = 0;
>        init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
>        init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
> +}
>
> -       rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
> -                               mempool_free_slab, request_cachep, q->node);
> +static int blk_init_free_list(struct request_queue *q)
> +{
> +       /*
> +        * In case of group scheduling, request list is inside group and is
> +        * initialized when group is instanciated.
> +        */
> +#ifndef CONFIG_GROUP_IOSCHED
> +       blk_init_request_list(&q->rq);
> +#endif
> +       q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
> +                               mempool_alloc_slab, mempool_free_slab,
> +                               request_cachep, q->node);
>
> -       if (!rl->rq_pool)
> +       if (!q->rq_data.rq_pool)
>                return -ENOMEM;
>
>        return 0;
> @@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
>        q->queue_flags          = QUEUE_FLAG_DEFAULT;
>        q->queue_lock           = lock;
>
> +       /* init starved waiter wait queue */
> +       init_waitqueue_head(&q->rq_data.starved_wait);
> +
>        /*
>         * This also sets hw/phys segments, boundary and size
>         */
> @@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
>  {
>        if (rq->cmd_flags & REQ_ELVPRIV)
>                elv_put_request(q, rq);
> -       mempool_free(rq, q->rq.rq_pool);
> +       mempool_free(rq, q->rq_data.rq_pool);
>  }
>
>  static struct request *
>  blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>                                        gfp_t gfp_mask)
>  {
> -       struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> +       struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
>
>        if (!rq)
>                return NULL;
> @@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>
>        if (priv) {
>                if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
> -                       mempool_free(rq, q->rq.rq_pool);
> +                       mempool_free(rq, q->rq_data.rq_pool);
>                        return NULL;
>                }
>                rq->cmd_flags |= REQ_ELVPRIV;
> @@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>        ioc->last_waited = jiffies;
>  }
>
> -static void __freed_request(struct request_queue *q, int sync)
> +static void __freed_request(struct request_queue *q, int sync,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> -       if (rl->count[sync] < queue_congestion_off_threshold(q))
> +       if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, sync);
>
> -       if (rl->count[sync] + 1 <= q->nr_requests) {
> +       if (q->rq_data.count[sync] + 1 <= q->nr_requests)
> +               blk_clear_queue_full(q, sync);
> +
> +       if (rl->count[sync] + 1 <= q->nr_group_requests) {
>                if (waitqueue_active(&rl->wait[sync]))
>                        wake_up(&rl->wait[sync]);
> -
> -               blk_clear_queue_full(q, sync);
>        }
>  }
>
> @@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int sync, int priv)
> +static void freed_request(struct request_queue *q, int sync, int priv,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> +       /*
> +        * There is a window during request allocation where request is
> +        * mapped to one group but by the time a queue for the group is
> +        * allocated, it is possible that original cgroup/io group has been
> +        * deleted and now io queue is allocated in a different group (root)
> +        * altogether.
> +        *
> +        * One solution to the problem is that rq should take io group
> +        * reference. But it looks too much to do that to solve this issue.
> +        * The only side affect to the hard to hit issue seems to be that
> +        * we will try to decrement the rl->count for a request list which
> +        * did not allocate that request. Chcek for rl->count going less than
> +        * zero and do not decrement it if that's the case.
> +        */
> +
> +       if (priv && rl->count[sync] > 0)
> +               rl->count[sync]--;
> +
> +       BUG_ON(!q->rq_data.count[sync]);
> +       q->rq_data.count[sync]--;
>
> -       rl->count[sync]--;
>        if (priv)
> -               rl->elvpriv--;
> +               q->rq_data.elvpriv--;
>
> -       __freed_request(q, sync);
> +       __freed_request(q, sync, rl);
>
>        if (unlikely(rl->starved[sync ^ 1]))
> -               __freed_request(q, sync ^ 1);
> +               __freed_request(q, sync ^ 1, rl);
> +
> +       /* Wake up the starved process on global list, if any */
> +       if (unlikely(q->rq_data.starved)) {
> +               if (waitqueue_active(&q->rq_data.starved_wait))
> +                       wake_up(&q->rq_data.starved_wait);
> +               q->rq_data.starved--;
> +       }
> +}
> +
> +/*
> + * Returns whether one can sleep on this request list or not. There are
> + * cases (elevator switch) where request list might not have allocated
> + * any request descriptor but we deny request allocation due to gloabl
> + * limits. In that case one should sleep on global list as on this request
> + * list no wakeup will take place.
> + *
> + * Also sets the request list starved flag if there are no requests pending
> + * in the direction of rq.
> + *
> + * Return 1 --> sleep on request list, 0 --> sleep on global list
> + */
> +static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
> +{
> +       if (unlikely(rl->count[is_sync] == 0)) {
> +               /*
> +                * If there is a request pending in other direction
> +                * in same io group, then set the starved flag of
> +                * the group request list. Otherwise, we need to
> +                * make this process sleep in global starved list
> +                * to make sure it will not sleep indefinitely.
> +                */
> +               if (rl->count[is_sync ^ 1] != 0) {
> +                       rl->starved[is_sync] = 1;
> +                       return 1;
> +               } else
> +                       return 0;
> +       }
> +
> +       return 1;
>  }
>
>  /*
>  * Get a free request, queue_lock must be held.
> - * Returns NULL on failure, with queue_lock held.
> + * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
> + * in case of failure. This reason field helps caller decide to whether sleep
> + * on per group list or global per queue list.
> + * reason = 0 sleep on per group list
> + * reason = 1 sleep on global list
> + *
>  * Returns !NULL on success, with queue_lock *not held*.
>  */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
> -                                  struct bio *bio, gfp_t gfp_mask)
> +                                       struct bio *bio, gfp_t gfp_mask,
> +                                       struct request_list *rl, int *reason)
>  {
>        struct request *rq = NULL;
> -       struct request_list *rl = &q->rq;
>        struct io_context *ioc = NULL;
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        int may_queue, priv;
> +       int sleep_on_global = 0;
>
>        may_queue = elv_may_queue(q, rw_flags);
>        if (may_queue == ELV_MQUEUE_NO)
>                goto rq_starved;
>
> -       if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
> -               if (rl->count[is_sync]+1 >= q->nr_requests) {
> -                       ioc = current_io_context(GFP_ATOMIC, q->node);
> -                       /*
> -                        * The queue will fill after this allocation, so set
> -                        * it as full, and mark this process as "batching".
> -                        * This process will be allowed to complete a batch of
> -                        * requests, others will be blocked.
> -                        */
> -                       if (!blk_queue_full(q, is_sync)) {
> -                               ioc_set_batching(q, ioc);
> -                               blk_set_queue_full(q, is_sync);
> -                       } else {
> -                               if (may_queue != ELV_MQUEUE_MUST
> -                                               && !ioc_batching(q, ioc)) {
> -                                       /*
> -                                        * The queue is full and the allocating
> -                                        * process is not a "batcher", and not
> -                                        * exempted by the IO scheduler
> -                                        */
> -                                       goto out;
> -                               }
> +       if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
> +               blk_set_queue_congested(q, is_sync);
> +
> +       /* queue full seems redundant now */
> +       if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
> +               blk_set_queue_full(q, is_sync);
> +
> +       if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +               ioc = current_io_context(GFP_ATOMIC, q->node);
> +               /*
> +                * The queue request descriptor group will fill after this
> +                * allocation, so set it as full, and mark this process as
> +                * "batching". This process will be allowed to complete a
> +                * batch of requests, others will be blocked.
> +                */
> +               if (rl->count[is_sync] <= q->nr_group_requests)
> +                       ioc_set_batching(q, ioc);
> +               else {
> +                       if (may_queue != ELV_MQUEUE_MUST
> +                                       && !ioc_batching(q, ioc)) {
> +                               /*
> +                                * The queue is full and the allocating
> +                                * process is not a "batcher", and not
> +                                * exempted by the IO scheduler
> +                                */
> +                               goto out;
>                        }
>                }
> -               blk_set_queue_congested(q, is_sync);
>        }
>
>        /*
> @@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>         * limit of requests, otherwise we could have thousands of requests
>         * allocated with any setting of ->nr_requests
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +
> +       if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
> +               /*
> +                * Queue is too full for allocation. On which request queue
> +                * the task should sleep? Generally it should sleep on its
> +                * request list but if elevator switch is happening, in that
> +                * window, request descriptors are allocated from global
> +                * pool and are not accounted against any particular request
> +                * list as group is going away.
> +                *
> +                * So it might happen that request list does not have any
> +                * requests allocated at all and if process sleeps on per
> +                * group request list, it will not be woken up. In such case,
> +                * make it sleep on global starved list.
> +                */
> +               if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
> +                   || !can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
> +               goto out;
> +       }
> +
> +       /*
> +        * Allocation of request is allowed from queue perspective. Now check
> +        * from per group request list
> +        */
> +
> +       if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
>                goto out;
>
> -       rl->count[is_sync]++;
>        rl->starved[is_sync] = 0;
>
> +       q->rq_data.count[is_sync]++;
> +
>        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -       if (priv)
> -               rl->elvpriv++;
> +       if (priv) {
> +               q->rq_data.elvpriv++;
> +               /*
> +                * Account the request to request list only if request is
> +                * going to elevator. During elevator switch, there will
> +                * be small window where group is going away and new group
> +                * will not be allocated till elevator switch is complete.
> +                * So till then instead of slowing down the application,
> +                * we will continue to allocate request from total common
> +                * pool instead of per group limit
> +                */
> +               rl->count[is_sync]++;
> +       }
>
>        if (blk_queue_io_stat(q))
>                rw_flags |= REQ_IO_STAT;
>        spin_unlock_irq(q->queue_lock);
>
>        rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
> +
>        if (unlikely(!rq)) {
>                /*
>                 * Allocation failed presumably due to memory. Undo anything
> @@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * wait queue, but this is pretty rare.
>                 */
>                spin_lock_irq(q->queue_lock);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>
>                /*
>                 * in the very unlikely event that allocation failed and no
> @@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * rq mempool into READ and WRITE
>                 */
>  rq_starved:
> -               if (unlikely(rl->count[is_sync] == 0))
> -                       rl->starved[is_sync] = 1;
> -
> +               if (!can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
>                goto out;
>        }
>
> @@ -810,6 +951,8 @@ rq_starved:
>
>        trace_block_getrq(q, bio, rw_flags & 1);
>  out:
> +       if (reason && sleep_on_global)
> +               *reason = 1;
>        return rq;
>  }
>
> @@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                                        struct bio *bio)
>  {
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
> +       int sleep_on_global = 0;
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, bio);
>
> -       rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +       rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
>        while (!rq) {
>                DEFINE_WAIT(wait);
>                struct io_context *ioc;
> -               struct request_list *rl = &q->rq;
>
> -               prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> -                               TASK_UNINTERRUPTIBLE);
> +               if (sleep_on_global) {
> +                       /*
> +                        * Task failed allocation and needs to wait and
> +                        * try again. There are no requests pending from
> +                        * the io group hence need to sleep on global
> +                        * wait queue. Most likely the allocation failed
> +                        * because of memory issues.
> +                        */
> +
> +                       q->rq_data.starved++;
> +                       prepare_to_wait_exclusive(&q->rq_data.starved_wait,
> +                                       &wait, TASK_UNINTERRUPTIBLE);
> +               } else {
> +                       /*
> +                        * We are about to sleep on a request list and we
> +                        * drop queue lock. After waking up, we will do
> +                        * finish_wait() on request list and in the mean
> +                        * time group might be gone. Take a reference to
> +                        * the group now.
> +                        */
> +                       prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +                       elv_get_rl_iog(rl);
> +               }
>
>                trace_block_sleeprq(q, bio, rw_flags & 1);
>
> @@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                ioc_set_batching(q, ioc);
>
>                spin_lock_irq(q->queue_lock);
> -               finish_wait(&rl->wait[is_sync], &wait);
>
> -               rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +               if (sleep_on_global) {
> +                       finish_wait(&q->rq_data.starved_wait, &wait);
> +                       sleep_on_global = 0;
> +               } else {
> +                       /*
> +                        * We had taken a reference to the rl/iog. Put that now
> +                        */
> +                       finish_wait(&rl->wait[is_sync], &wait);
> +                       elv_put_rl_iog(rl);
> +               }
> +
> +               /*
> +                * After the sleep check the rl again in case cgrop bio
> +                * belonged to is gone and it is mapped to root group now
> +                */
> +               rl = blk_get_request_list(q, bio);
> +               rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
> +                                       &sleep_on_global);
>        };
>
>        return rq;
> @@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>        struct request *rq;
> +       struct request_list *rl;
>
>        BUG_ON(rw != READ && rw != WRITE);
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        if (gfp_mask & __GFP_WAIT) {
>                rq = get_request_wait(q, rw, NULL);
>        } else {
> -               rq = get_request(q, rw, NULL, gfp_mask);
> +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
>                if (!rq)
>                        spin_unlock_irq(q->queue_lock);
>        }
> @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>        if (req->cmd_flags & REQ_ALLOCED) {
>                int is_sync = rq_is_sync(req) != 0;
>                int priv = req->cmd_flags & REQ_ELVPRIV;
> +               struct request_list *rl = rq_rl(q, req);
>
>                BUG_ON(!list_empty(&req->queuelist));
>                BUG_ON(!hlist_unhashed(&req->hash));
>
>                blk_free_request(q, req);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);

We have a potential memory bug here. freed_request should be called
before blk_free_request as blk_free_request might result in release of
cgroup, and request_list. Calling freed_request after blk_free_request
would result in operations on freed memory.

>        }
>  }
>  EXPORT_SYMBOL_GPL(__blk_put_request);
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 476d870..c3102c7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>         * set defaults
>         */
>        q->nr_requests = BLKDEV_MAX_RQ;
> +       q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>
>        q->make_request_fn = mfn;
>        blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 418d636..f3db7f0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  {
> -       struct request_list *rl = &q->rq;
> +       struct request_list *rl;
>        unsigned long nr;
>        int ret = queue_var_store(&nr, page, count);
>        if (nr < BLKDEV_MIN_RQ)
>                nr = BLKDEV_MIN_RQ;
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        q->nr_requests = nr;
>        blk_queue_congestion_threshold(q);
>
> -       if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_SYNC);
> -       else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_SYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_SYNC);
>
> -       if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_ASYNC);
> -       else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_ASYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_ASYNC);
>
> -       if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_SYNC);
> -       } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_SYNC);
>                wake_up(&rl->wait[BLK_RW_SYNC]);
>        }
>
> -       if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_ASYNC);
> -       } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_ASYNC);
>                wake_up(&rl->wait[BLK_RW_ASYNC]);
>        }
>        spin_unlock_irq(q->queue_lock);
>        return ret;
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +                                       size_t count)
> +{
> +       unsigned long nr;
> +       int ret = queue_var_store(&nr, page, count);
> +
> +       if (nr < BLKDEV_MIN_RQ)
> +               nr = BLKDEV_MIN_RQ;
> +
> +       spin_lock_irq(q->queue_lock);
> +       q->nr_group_requests = nr;
> +       spin_unlock_irq(q->queue_lock);
> +       return ret;
> +}
> +#endif
>
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
>  {
> @@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
>        .store = queue_requests_store,
>  };
>
> +#ifdef CONFIG_GROUP_IOSCHED
> +static struct queue_sysfs_entry queue_group_requests_entry = {
> +       .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_group_requests_show,
> +       .store = queue_group_requests_store,
> +};
> +#endif
> +
>  static struct queue_sysfs_entry queue_ra_entry = {
>        .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>        .show = queue_ra_show,
> @@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> +#ifdef CONFIG_GROUP_IOSCHED
> +       &queue_group_requests_entry.attr,
> +#endif
>        &queue_ra_entry.attr,
>        &queue_max_hw_sectors_entry.attr,
>        &queue_max_sectors_entry.attr,
> @@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
>  {
>        struct request_queue *q =
>                container_of(kobj, struct request_queue, kobj);
> -       struct request_list *rl = &q->rq;
>
>        blk_sync_queue(q);
>
> -       if (rl->rq_pool)
> -               mempool_destroy(rl->rq_pool);
> +       if (q->rq_data.rq_pool)
> +               mempool_destroy(q->rq_data.rq_pool);
>
>        if (q->queue_tags)
>                __blk_queue_free_tags(q);
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9c8783c..39896c2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
>                            struct io_cgroup, css);
>  }
>
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               iog = q->elevator->efqd->root_group;
> +       else
> +               iog = elv_io_get_io_group_bio(q, bio, 1);
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               return &q->elevator->efqd->root_group->rl;
> +
> +       BUG_ON(priv && !rq->ioq);
> +
> +       if (priv)
> +               iog = ioq_to_io_group(rq->ioq);
> +       else
> +               iog = q->elevator->efqd->root_group;
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
>  /*
>  * Search the io_group for efqd into the hash table (by now only a list)
>  * of bgrp.  Must be called under rcu_read_lock().
> @@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>                elv_get_iog(iog);
>                io_group_path(iog);
>
> +               blk_init_request_list(&iog->rl);
> +
>                if (leaf == NULL) {
>                        leaf = iog;
>                        prev = leaf;
> @@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>        for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>                iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
>
> +       blk_init_request_list(&iog->rl);
>        spin_lock_irq(&iocg->lock);
>        rcu_assign_pointer(iog->key, key);
>        hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 9fe52fa..989102e 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -128,6 +128,9 @@ struct io_group {
>
>        /* Single ioq per group, used for noop, deadline, anticipatory */
>        struct io_queue *ioq;
> +
> +       /* request list associated with the group */
> +       struct request_list rl;
>  };
>
>  struct io_cgroup {
> @@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
>        atomic_inc(&iog->ref);
>  }
>
> +static inline struct io_group *rl_iog(struct request_list *rl)
> +{
> +       return container_of(rl, struct io_group, rl);
> +}
> +
> +static inline void elv_get_rl_iog(struct request_list *rl)
> +{
> +       elv_get_iog(rl_iog(rl));
> +}
> +
> +static inline void elv_put_rl_iog(struct request_list *rl)
> +{
> +       elv_put_iog(rl_iog(rl));
> +}
> +
>  extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
>                                        struct bio *bio, gfp_t gfp_mask);
>  extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
>  extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>                                                struct bio *bio);
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
>
>  #else /* !GROUP_IOSCHED */
>
> @@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* GROUP_IOSCHED */
>
>  extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
> @@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* CONFIG_ELV_FAIR_QUEUING */
>  #endif /* _ELV_SCHED_H */
>  #endif /* CONFIG_BLOCK */
> diff --git a/block/elevator.c b/block/elevator.c
> index 4ed37b6..b23db03 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
>         * make sure we don't have any requests in flight
>         */
>        elv_drain_elevator(q);
> -       while (q->rq.elvpriv) {
> +       while (q->rq_data.elvpriv) {
>                __blk_run_queue(q);
>                spin_unlock_irq(q->queue_lock);
>                msleep(10);
> @@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
>        }
>
>        if (unplug_it && blk_queue_plugged(q)) {
> -               int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
> -                               - queue_in_flight(q);
> +               int nrq = q->rq_data.count[BLK_RW_SYNC] +
> +                               q->rq_data.count[BLK_RW_ASYNC] -
> +                               queue_in_flight(q);
>
>                if (nrq >= q->unplug_thresh)
>                        __generic_unplug_device(q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7cff5f2..74deb17 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -32,21 +32,51 @@ struct request;
>  struct sg_io_hdr;
>
>  #define BLKDEV_MIN_RQ  4
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +#define BLKDEV_MAX_RQ  512     /* Default maximum for queue */
> +#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
> +#else
>  #define BLKDEV_MAX_RQ  128     /* Default maximum */
> +/*
> + * This is eqivalent to case of only one group present (root group). Let
> + * it consume all the request descriptors available on the queue .
> + */
> +#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
> +#endif
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
>
>  struct request_list {
>        /*
> -        * count[], starved[], and wait[] are indexed by
> +        * count[], starved and wait[] are indexed by
>         * BLK_RW_SYNC/BLK_RW_ASYNC
>         */
>        int count[2];
>        int starved[2];
> +       wait_queue_head_t wait[2];
> +};
> +
> +/*
> + * This data structures keeps track of mempool of requests for the queue
> + * and some overall statistics.
> + */
> +struct request_data {
> +       /*
> +        * Per queue request descriptor count. This is in addition to per
> +        * cgroup count
> +        */
> +       int count[2];
>        int elvpriv;
>        mempool_t *rq_pool;
> -       wait_queue_head_t wait[2];
> +       int starved;
> +       /*
> +        * Global list for starved tasks. A task will be queued here if
> +        * it could not allocate request descriptor and the associated
> +        * group request list does not have any requests pending.
> +        */
> +       wait_queue_head_t starved_wait;
>  };
>
>  /*
> @@ -339,10 +369,17 @@ struct request_queue
>        struct request          *last_merge;
>        struct elevator_queue   *elevator;
>
> +#ifndef CONFIG_GROUP_IOSCHED
>        /*
>         * the queue request freelist, one for reads and one for writes
> +        * In case of group io scheduling, this request list is per group
> +        * and is present in group data structure.
>         */
>        struct request_list     rq;
> +#endif
> +
> +       /* Contains request pool and other data like starved data */
> +       struct request_data     rq_data;
>
>        request_fn_proc         *request_fn;
>        make_request_fn         *make_request_fn;
> @@ -405,6 +442,8 @@ struct request_queue
>         * queue settings
>         */
>        unsigned long           nr_requests;    /* Max # of requests */
> +       /* Max # of per io group requests */
> +       unsigned long           nr_group_requests;
>        unsigned int            nr_congestion_on;
>        unsigned int            nr_congestion_off;
>        unsigned int            nr_batching;
> @@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>  extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>                         struct scsi_ioctl_command __user *);
>
> +extern void blk_init_request_list(struct request_list *rl);
> +
> +extern struct request_list *blk_get_request_list(struct request_queue *q,
> +                                                       struct bio *bio);
>  /*
>  * A queue has just exitted congestion.  Note this in the global counter of
>  * congested queues, and wake up anyone who was waiting for requests to be
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 9a74b46..af6c9e5 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> @@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7a34cb5..9a03980 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
> @@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor  support
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-09-14 18:33     ` Nauman Rafique
  -1 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-14 18:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, Aug 28, 2009 at 2:31 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> o Currently a request queue has got fixed number of request descriptors for
>  sync and async requests. Once the request descriptors are consumed, new
>  processes are put to sleep and they effectively become serialized. Because
>  sync and async queues are separate, async requests don't impact sync ones
>  but if one is looking for fairness between async requests, that is not
>  achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
>  going on in one cgroup, it does not impact the IO of other group.
>
> o This patch implements the per cgroup request descriptors. request pool per
>  queue is still common but every group will have its own wait list and its
>  own count of request descriptors allocated to that group for sync and async
>  queues. So effectively request_list becomes per io group property and not a
>  global request queue feature.
>
> o Currently one can define q->nr_requests to limit request descriptors
>  allocated for the queue. Now there is another tunable q->nr_group_requests
>  which controls the requests descriptr limit per group. q->nr_requests
>  supercedes q->nr_group_requests to make sure if there are lots of groups
>  present, we don't end up allocating too many request descriptors on the
>  queue.
>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
>  block/blk-settings.c         |    1 +
>  block/blk-sysfs.c            |   59 ++++++--
>  block/elevator-fq.c          |   36 +++++
>  block/elevator-fq.h          |   29 ++++
>  block/elevator.c             |    7 +-
>  include/linux/blkdev.h       |   47 ++++++-
>  include/trace/events/block.h |    6 +-
>  kernel/trace/blktrace.c      |    6 +-
>  9 files changed, 421 insertions(+), 87 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 47cce59..18b400b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(blk_cleanup_queue);
>
> -static int blk_init_free_list(struct request_queue *q)
> +struct request_list *
> +blk_get_request_list(struct request_queue *q, struct bio *bio)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       /*
> +        * Determine which request list bio will be allocated from. This
> +        * is dependent on which io group bio belongs to
> +        */
> +       return elv_get_request_list_bio(q, bio);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       int priv = rq->cmd_flags & REQ_ELVPRIV;
> +
> +       return elv_get_request_list_rq(q, rq, priv);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +void blk_init_request_list(struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
>
>        rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
> -       rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
> -       rl->elvpriv = 0;
>        init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
>        init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
> +}
>
> -       rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
> -                               mempool_free_slab, request_cachep, q->node);
> +static int blk_init_free_list(struct request_queue *q)
> +{
> +       /*
> +        * In case of group scheduling, request list is inside group and is
> +        * initialized when group is instanciated.
> +        */
> +#ifndef CONFIG_GROUP_IOSCHED
> +       blk_init_request_list(&q->rq);
> +#endif
> +       q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
> +                               mempool_alloc_slab, mempool_free_slab,
> +                               request_cachep, q->node);
>
> -       if (!rl->rq_pool)
> +       if (!q->rq_data.rq_pool)
>                return -ENOMEM;
>
>        return 0;
> @@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
>        q->queue_flags          = QUEUE_FLAG_DEFAULT;
>        q->queue_lock           = lock;
>
> +       /* init starved waiter wait queue */
> +       init_waitqueue_head(&q->rq_data.starved_wait);
> +
>        /*
>         * This also sets hw/phys segments, boundary and size
>         */
> @@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
>  {
>        if (rq->cmd_flags & REQ_ELVPRIV)
>                elv_put_request(q, rq);
> -       mempool_free(rq, q->rq.rq_pool);
> +       mempool_free(rq, q->rq_data.rq_pool);
>  }
>
>  static struct request *
>  blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>                                        gfp_t gfp_mask)
>  {
> -       struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> +       struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
>
>        if (!rq)
>                return NULL;
> @@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>
>        if (priv) {
>                if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
> -                       mempool_free(rq, q->rq.rq_pool);
> +                       mempool_free(rq, q->rq_data.rq_pool);
>                        return NULL;
>                }
>                rq->cmd_flags |= REQ_ELVPRIV;
> @@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>        ioc->last_waited = jiffies;
>  }
>
> -static void __freed_request(struct request_queue *q, int sync)
> +static void __freed_request(struct request_queue *q, int sync,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> -       if (rl->count[sync] < queue_congestion_off_threshold(q))
> +       if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, sync);
>
> -       if (rl->count[sync] + 1 <= q->nr_requests) {
> +       if (q->rq_data.count[sync] + 1 <= q->nr_requests)
> +               blk_clear_queue_full(q, sync);
> +
> +       if (rl->count[sync] + 1 <= q->nr_group_requests) {
>                if (waitqueue_active(&rl->wait[sync]))
>                        wake_up(&rl->wait[sync]);
> -
> -               blk_clear_queue_full(q, sync);
>        }
>  }
>
> @@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int sync, int priv)
> +static void freed_request(struct request_queue *q, int sync, int priv,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> +       /*
> +        * There is a window during request allocation where request is
> +        * mapped to one group but by the time a queue for the group is
> +        * allocated, it is possible that original cgroup/io group has been
> +        * deleted and now io queue is allocated in a different group (root)
> +        * altogether.
> +        *
> +        * One solution to the problem is that rq should take io group
> +        * reference. But it looks too much to do that to solve this issue.
> +        * The only side affect to the hard to hit issue seems to be that
> +        * we will try to decrement the rl->count for a request list which
> +        * did not allocate that request. Chcek for rl->count going less than
> +        * zero and do not decrement it if that's the case.
> +        */
> +
> +       if (priv && rl->count[sync] > 0)
> +               rl->count[sync]--;
> +
> +       BUG_ON(!q->rq_data.count[sync]);
> +       q->rq_data.count[sync]--;
>
> -       rl->count[sync]--;
>        if (priv)
> -               rl->elvpriv--;
> +               q->rq_data.elvpriv--;
>
> -       __freed_request(q, sync);
> +       __freed_request(q, sync, rl);
>
>        if (unlikely(rl->starved[sync ^ 1]))
> -               __freed_request(q, sync ^ 1);
> +               __freed_request(q, sync ^ 1, rl);
> +
> +       /* Wake up the starved process on global list, if any */
> +       if (unlikely(q->rq_data.starved)) {
> +               if (waitqueue_active(&q->rq_data.starved_wait))
> +                       wake_up(&q->rq_data.starved_wait);
> +               q->rq_data.starved--;
> +       }
> +}
> +
> +/*
> + * Returns whether one can sleep on this request list or not. There are
> + * cases (elevator switch) where request list might not have allocated
> + * any request descriptor but we deny request allocation due to gloabl
> + * limits. In that case one should sleep on global list as on this request
> + * list no wakeup will take place.
> + *
> + * Also sets the request list starved flag if there are no requests pending
> + * in the direction of rq.
> + *
> + * Return 1 --> sleep on request list, 0 --> sleep on global list
> + */
> +static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
> +{
> +       if (unlikely(rl->count[is_sync] == 0)) {
> +               /*
> +                * If there is a request pending in other direction
> +                * in same io group, then set the starved flag of
> +                * the group request list. Otherwise, we need to
> +                * make this process sleep in global starved list
> +                * to make sure it will not sleep indefinitely.
> +                */
> +               if (rl->count[is_sync ^ 1] != 0) {
> +                       rl->starved[is_sync] = 1;
> +                       return 1;
> +               } else
> +                       return 0;
> +       }
> +
> +       return 1;
>  }
>
>  /*
>  * Get a free request, queue_lock must be held.
> - * Returns NULL on failure, with queue_lock held.
> + * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
> + * in case of failure. This reason field helps caller decide to whether sleep
> + * on per group list or global per queue list.
> + * reason = 0 sleep on per group list
> + * reason = 1 sleep on global list
> + *
>  * Returns !NULL on success, with queue_lock *not held*.
>  */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
> -                                  struct bio *bio, gfp_t gfp_mask)
> +                                       struct bio *bio, gfp_t gfp_mask,
> +                                       struct request_list *rl, int *reason)
>  {
>        struct request *rq = NULL;
> -       struct request_list *rl = &q->rq;
>        struct io_context *ioc = NULL;
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        int may_queue, priv;
> +       int sleep_on_global = 0;
>
>        may_queue = elv_may_queue(q, rw_flags);
>        if (may_queue == ELV_MQUEUE_NO)
>                goto rq_starved;
>
> -       if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
> -               if (rl->count[is_sync]+1 >= q->nr_requests) {
> -                       ioc = current_io_context(GFP_ATOMIC, q->node);
> -                       /*
> -                        * The queue will fill after this allocation, so set
> -                        * it as full, and mark this process as "batching".
> -                        * This process will be allowed to complete a batch of
> -                        * requests, others will be blocked.
> -                        */
> -                       if (!blk_queue_full(q, is_sync)) {
> -                               ioc_set_batching(q, ioc);
> -                               blk_set_queue_full(q, is_sync);
> -                       } else {
> -                               if (may_queue != ELV_MQUEUE_MUST
> -                                               && !ioc_batching(q, ioc)) {
> -                                       /*
> -                                        * The queue is full and the allocating
> -                                        * process is not a "batcher", and not
> -                                        * exempted by the IO scheduler
> -                                        */
> -                                       goto out;
> -                               }
> +       if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
> +               blk_set_queue_congested(q, is_sync);
> +
> +       /* queue full seems redundant now */
> +       if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
> +               blk_set_queue_full(q, is_sync);
> +
> +       if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +               ioc = current_io_context(GFP_ATOMIC, q->node);
> +               /*
> +                * The queue request descriptor group will fill after this
> +                * allocation, so set it as full, and mark this process as
> +                * "batching". This process will be allowed to complete a
> +                * batch of requests, others will be blocked.
> +                */
> +               if (rl->count[is_sync] <= q->nr_group_requests)
> +                       ioc_set_batching(q, ioc);
> +               else {
> +                       if (may_queue != ELV_MQUEUE_MUST
> +                                       && !ioc_batching(q, ioc)) {
> +                               /*
> +                                * The queue is full and the allocating
> +                                * process is not a "batcher", and not
> +                                * exempted by the IO scheduler
> +                                */
> +                               goto out;
>                        }
>                }
> -               blk_set_queue_congested(q, is_sync);
>        }
>
>        /*
> @@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>         * limit of requests, otherwise we could have thousands of requests
>         * allocated with any setting of ->nr_requests
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +
> +       if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
> +               /*
> +                * Queue is too full for allocation. On which request queue
> +                * the task should sleep? Generally it should sleep on its
> +                * request list but if elevator switch is happening, in that
> +                * window, request descriptors are allocated from global
> +                * pool and are not accounted against any particular request
> +                * list as group is going away.
> +                *
> +                * So it might happen that request list does not have any
> +                * requests allocated at all and if process sleeps on per
> +                * group request list, it will not be woken up. In such case,
> +                * make it sleep on global starved list.
> +                */
> +               if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
> +                   || !can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
> +               goto out;
> +       }
> +
> +       /*
> +        * Allocation of request is allowed from queue perspective. Now check
> +        * from per group request list
> +        */
> +
> +       if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
>                goto out;
>
> -       rl->count[is_sync]++;
>        rl->starved[is_sync] = 0;
>
> +       q->rq_data.count[is_sync]++;
> +
>        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -       if (priv)
> -               rl->elvpriv++;
> +       if (priv) {
> +               q->rq_data.elvpriv++;
> +               /*
> +                * Account the request to request list only if request is
> +                * going to elevator. During elevator switch, there will
> +                * be small window where group is going away and new group
> +                * will not be allocated till elevator switch is complete.
> +                * So till then instead of slowing down the application,
> +                * we will continue to allocate request from total common
> +                * pool instead of per group limit
> +                */
> +               rl->count[is_sync]++;
> +       }
>
>        if (blk_queue_io_stat(q))
>                rw_flags |= REQ_IO_STAT;
>        spin_unlock_irq(q->queue_lock);
>
>        rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
> +
>        if (unlikely(!rq)) {
>                /*
>                 * Allocation failed presumably due to memory. Undo anything
> @@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * wait queue, but this is pretty rare.
>                 */
>                spin_lock_irq(q->queue_lock);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>
>                /*
>                 * in the very unlikely event that allocation failed and no
> @@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * rq mempool into READ and WRITE
>                 */
>  rq_starved:
> -               if (unlikely(rl->count[is_sync] == 0))
> -                       rl->starved[is_sync] = 1;
> -
> +               if (!can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
>                goto out;
>        }
>
> @@ -810,6 +951,8 @@ rq_starved:
>
>        trace_block_getrq(q, bio, rw_flags & 1);
>  out:
> +       if (reason && sleep_on_global)
> +               *reason = 1;
>        return rq;
>  }
>
> @@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                                        struct bio *bio)
>  {
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
> +       int sleep_on_global = 0;
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, bio);
>
> -       rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +       rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
>        while (!rq) {
>                DEFINE_WAIT(wait);
>                struct io_context *ioc;
> -               struct request_list *rl = &q->rq;
>
> -               prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> -                               TASK_UNINTERRUPTIBLE);
> +               if (sleep_on_global) {
> +                       /*
> +                        * Task failed allocation and needs to wait and
> +                        * try again. There are no requests pending from
> +                        * the io group hence need to sleep on global
> +                        * wait queue. Most likely the allocation failed
> +                        * because of memory issues.
> +                        */
> +
> +                       q->rq_data.starved++;
> +                       prepare_to_wait_exclusive(&q->rq_data.starved_wait,
> +                                       &wait, TASK_UNINTERRUPTIBLE);
> +               } else {
> +                       /*
> +                        * We are about to sleep on a request list and we
> +                        * drop queue lock. After waking up, we will do
> +                        * finish_wait() on request list and in the mean
> +                        * time group might be gone. Take a reference to
> +                        * the group now.
> +                        */
> +                       prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +                       elv_get_rl_iog(rl);
> +               }
>
>                trace_block_sleeprq(q, bio, rw_flags & 1);
>
> @@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                ioc_set_batching(q, ioc);
>
>                spin_lock_irq(q->queue_lock);
> -               finish_wait(&rl->wait[is_sync], &wait);
>
> -               rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +               if (sleep_on_global) {
> +                       finish_wait(&q->rq_data.starved_wait, &wait);
> +                       sleep_on_global = 0;
> +               } else {
> +                       /*
> +                        * We had taken a reference to the rl/iog. Put that now
> +                        */
> +                       finish_wait(&rl->wait[is_sync], &wait);
> +                       elv_put_rl_iog(rl);
> +               }
> +
> +               /*
> +                * After the sleep check the rl again in case cgrop bio
> +                * belonged to is gone and it is mapped to root group now
> +                */
> +               rl = blk_get_request_list(q, bio);
> +               rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
> +                                       &sleep_on_global);
>        };
>
>        return rq;
> @@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>        struct request *rq;
> +       struct request_list *rl;
>
>        BUG_ON(rw != READ && rw != WRITE);
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        if (gfp_mask & __GFP_WAIT) {
>                rq = get_request_wait(q, rw, NULL);
>        } else {
> -               rq = get_request(q, rw, NULL, gfp_mask);
> +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
>                if (!rq)
>                        spin_unlock_irq(q->queue_lock);
>        }
> @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>        if (req->cmd_flags & REQ_ALLOCED) {
>                int is_sync = rq_is_sync(req) != 0;
>                int priv = req->cmd_flags & REQ_ELVPRIV;
> +               struct request_list *rl = rq_rl(q, req);
>
>                BUG_ON(!list_empty(&req->queuelist));
>                BUG_ON(!hlist_unhashed(&req->hash));
>
>                blk_free_request(q, req);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);

We have a potential memory bug here. freed_request should be called
before blk_free_request as blk_free_request might result in release of
cgroup, and request_list. Calling freed_request after blk_free_request
would result in operations on freed memory.

>        }
>  }
>  EXPORT_SYMBOL_GPL(__blk_put_request);
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 476d870..c3102c7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>         * set defaults
>         */
>        q->nr_requests = BLKDEV_MAX_RQ;
> +       q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>
>        q->make_request_fn = mfn;
>        blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 418d636..f3db7f0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  {
> -       struct request_list *rl = &q->rq;
> +       struct request_list *rl;
>        unsigned long nr;
>        int ret = queue_var_store(&nr, page, count);
>        if (nr < BLKDEV_MIN_RQ)
>                nr = BLKDEV_MIN_RQ;
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        q->nr_requests = nr;
>        blk_queue_congestion_threshold(q);
>
> -       if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_SYNC);
> -       else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_SYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_SYNC);
>
> -       if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_ASYNC);
> -       else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_ASYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_ASYNC);
>
> -       if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_SYNC);
> -       } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_SYNC);
>                wake_up(&rl->wait[BLK_RW_SYNC]);
>        }
>
> -       if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_ASYNC);
> -       } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_ASYNC);
>                wake_up(&rl->wait[BLK_RW_ASYNC]);
>        }
>        spin_unlock_irq(q->queue_lock);
>        return ret;
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +                                       size_t count)
> +{
> +       unsigned long nr;
> +       int ret = queue_var_store(&nr, page, count);
> +
> +       if (nr < BLKDEV_MIN_RQ)
> +               nr = BLKDEV_MIN_RQ;
> +
> +       spin_lock_irq(q->queue_lock);
> +       q->nr_group_requests = nr;
> +       spin_unlock_irq(q->queue_lock);
> +       return ret;
> +}
> +#endif
>
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
>  {
> @@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
>        .store = queue_requests_store,
>  };
>
> +#ifdef CONFIG_GROUP_IOSCHED
> +static struct queue_sysfs_entry queue_group_requests_entry = {
> +       .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_group_requests_show,
> +       .store = queue_group_requests_store,
> +};
> +#endif
> +
>  static struct queue_sysfs_entry queue_ra_entry = {
>        .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>        .show = queue_ra_show,
> @@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> +#ifdef CONFIG_GROUP_IOSCHED
> +       &queue_group_requests_entry.attr,
> +#endif
>        &queue_ra_entry.attr,
>        &queue_max_hw_sectors_entry.attr,
>        &queue_max_sectors_entry.attr,
> @@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
>  {
>        struct request_queue *q =
>                container_of(kobj, struct request_queue, kobj);
> -       struct request_list *rl = &q->rq;
>
>        blk_sync_queue(q);
>
> -       if (rl->rq_pool)
> -               mempool_destroy(rl->rq_pool);
> +       if (q->rq_data.rq_pool)
> +               mempool_destroy(q->rq_data.rq_pool);
>
>        if (q->queue_tags)
>                __blk_queue_free_tags(q);
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9c8783c..39896c2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
>                            struct io_cgroup, css);
>  }
>
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               iog = q->elevator->efqd->root_group;
> +       else
> +               iog = elv_io_get_io_group_bio(q, bio, 1);
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               return &q->elevator->efqd->root_group->rl;
> +
> +       BUG_ON(priv && !rq->ioq);
> +
> +       if (priv)
> +               iog = ioq_to_io_group(rq->ioq);
> +       else
> +               iog = q->elevator->efqd->root_group;
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
>  /*
>  * Search the io_group for efqd into the hash table (by now only a list)
>  * of bgrp.  Must be called under rcu_read_lock().
> @@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>                elv_get_iog(iog);
>                io_group_path(iog);
>
> +               blk_init_request_list(&iog->rl);
> +
>                if (leaf == NULL) {
>                        leaf = iog;
>                        prev = leaf;
> @@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>        for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>                iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
>
> +       blk_init_request_list(&iog->rl);
>        spin_lock_irq(&iocg->lock);
>        rcu_assign_pointer(iog->key, key);
>        hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 9fe52fa..989102e 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -128,6 +128,9 @@ struct io_group {
>
>        /* Single ioq per group, used for noop, deadline, anticipatory */
>        struct io_queue *ioq;
> +
> +       /* request list associated with the group */
> +       struct request_list rl;
>  };
>
>  struct io_cgroup {
> @@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
>        atomic_inc(&iog->ref);
>  }
>
> +static inline struct io_group *rl_iog(struct request_list *rl)
> +{
> +       return container_of(rl, struct io_group, rl);
> +}
> +
> +static inline void elv_get_rl_iog(struct request_list *rl)
> +{
> +       elv_get_iog(rl_iog(rl));
> +}
> +
> +static inline void elv_put_rl_iog(struct request_list *rl)
> +{
> +       elv_put_iog(rl_iog(rl));
> +}
> +
>  extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
>                                        struct bio *bio, gfp_t gfp_mask);
>  extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
>  extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>                                                struct bio *bio);
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
>
>  #else /* !GROUP_IOSCHED */
>
> @@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* GROUP_IOSCHED */
>
>  extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
> @@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* CONFIG_ELV_FAIR_QUEUING */
>  #endif /* _ELV_SCHED_H */
>  #endif /* CONFIG_BLOCK */
> diff --git a/block/elevator.c b/block/elevator.c
> index 4ed37b6..b23db03 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
>         * make sure we don't have any requests in flight
>         */
>        elv_drain_elevator(q);
> -       while (q->rq.elvpriv) {
> +       while (q->rq_data.elvpriv) {
>                __blk_run_queue(q);
>                spin_unlock_irq(q->queue_lock);
>                msleep(10);
> @@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
>        }
>
>        if (unplug_it && blk_queue_plugged(q)) {
> -               int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
> -                               - queue_in_flight(q);
> +               int nrq = q->rq_data.count[BLK_RW_SYNC] +
> +                               q->rq_data.count[BLK_RW_ASYNC] -
> +                               queue_in_flight(q);
>
>                if (nrq >= q->unplug_thresh)
>                        __generic_unplug_device(q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7cff5f2..74deb17 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -32,21 +32,51 @@ struct request;
>  struct sg_io_hdr;
>
>  #define BLKDEV_MIN_RQ  4
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +#define BLKDEV_MAX_RQ  512     /* Default maximum for queue */
> +#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
> +#else
>  #define BLKDEV_MAX_RQ  128     /* Default maximum */
> +/*
> + * This is eqivalent to case of only one group present (root group). Let
> + * it consume all the request descriptors available on the queue .
> + */
> +#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
> +#endif
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
>
>  struct request_list {
>        /*
> -        * count[], starved[], and wait[] are indexed by
> +        * count[], starved and wait[] are indexed by
>         * BLK_RW_SYNC/BLK_RW_ASYNC
>         */
>        int count[2];
>        int starved[2];
> +       wait_queue_head_t wait[2];
> +};
> +
> +/*
> + * This data structures keeps track of mempool of requests for the queue
> + * and some overall statistics.
> + */
> +struct request_data {
> +       /*
> +        * Per queue request descriptor count. This is in addition to per
> +        * cgroup count
> +        */
> +       int count[2];
>        int elvpriv;
>        mempool_t *rq_pool;
> -       wait_queue_head_t wait[2];
> +       int starved;
> +       /*
> +        * Global list for starved tasks. A task will be queued here if
> +        * it could not allocate request descriptor and the associated
> +        * group request list does not have any requests pending.
> +        */
> +       wait_queue_head_t starved_wait;
>  };
>
>  /*
> @@ -339,10 +369,17 @@ struct request_queue
>        struct request          *last_merge;
>        struct elevator_queue   *elevator;
>
> +#ifndef CONFIG_GROUP_IOSCHED
>        /*
>         * the queue request freelist, one for reads and one for writes
> +        * In case of group io scheduling, this request list is per group
> +        * and is present in group data structure.
>         */
>        struct request_list     rq;
> +#endif
> +
> +       /* Contains request pool and other data like starved data */
> +       struct request_data     rq_data;
>
>        request_fn_proc         *request_fn;
>        make_request_fn         *make_request_fn;
> @@ -405,6 +442,8 @@ struct request_queue
>         * queue settings
>         */
>        unsigned long           nr_requests;    /* Max # of requests */
> +       /* Max # of per io group requests */
> +       unsigned long           nr_group_requests;
>        unsigned int            nr_congestion_on;
>        unsigned int            nr_congestion_off;
>        unsigned int            nr_batching;
> @@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>  extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>                         struct scsi_ioctl_command __user *);
>
> +extern void blk_init_request_list(struct request_list *rl);
> +
> +extern struct request_list *blk_get_request_list(struct request_queue *q,
> +                                                       struct bio *bio);
>  /*
>  * A queue has just exitted congestion.  Note this in the global counter of
>  * congested queues, and wake up anyone who was waiting for requests to be
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 9a74b46..af6c9e5 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> @@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7a34cb5..9a03980 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
> @@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
@ 2009-09-14 18:33     ` Nauman Rafique
  0 siblings, 0 replies; 322+ messages in thread
From: Nauman Rafique @ 2009-09-14 18:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Aug 28, 2009 at 2:31 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> o Currently a request queue has got fixed number of request descriptors for
>  sync and async requests. Once the request descriptors are consumed, new
>  processes are put to sleep and they effectively become serialized. Because
>  sync and async queues are separate, async requests don't impact sync ones
>  but if one is looking for fairness between async requests, that is not
>  achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
>  going on in one cgroup, it does not impact the IO of other group.
>
> o This patch implements the per cgroup request descriptors. request pool per
>  queue is still common but every group will have its own wait list and its
>  own count of request descriptors allocated to that group for sync and async
>  queues. So effectively request_list becomes per io group property and not a
>  global request queue feature.
>
> o Currently one can define q->nr_requests to limit request descriptors
>  allocated for the queue. Now there is another tunable q->nr_group_requests
>  which controls the requests descriptr limit per group. q->nr_requests
>  supercedes q->nr_group_requests to make sure if there are lots of groups
>  present, we don't end up allocating too many request descriptors on the
>  queue.
>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/blk-core.c             |  317 +++++++++++++++++++++++++++++++++---------
>  block/blk-settings.c         |    1 +
>  block/blk-sysfs.c            |   59 ++++++--
>  block/elevator-fq.c          |   36 +++++
>  block/elevator-fq.h          |   29 ++++
>  block/elevator.c             |    7 +-
>  include/linux/blkdev.h       |   47 ++++++-
>  include/trace/events/block.h |    6 +-
>  kernel/trace/blktrace.c      |    6 +-
>  9 files changed, 421 insertions(+), 87 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 47cce59..18b400b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(blk_cleanup_queue);
>
> -static int blk_init_free_list(struct request_queue *q)
> +struct request_list *
> +blk_get_request_list(struct request_queue *q, struct bio *bio)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       /*
> +        * Determine which request list bio will be allocated from. This
> +        * is dependent on which io group bio belongs to
> +        */
> +       return elv_get_request_list_bio(q, bio);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       int priv = rq->cmd_flags & REQ_ELVPRIV;
> +
> +       return elv_get_request_list_rq(q, rq, priv);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +void blk_init_request_list(struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
>
>        rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
> -       rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
> -       rl->elvpriv = 0;
>        init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
>        init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
> +}
>
> -       rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
> -                               mempool_free_slab, request_cachep, q->node);
> +static int blk_init_free_list(struct request_queue *q)
> +{
> +       /*
> +        * In case of group scheduling, request list is inside group and is
> +        * initialized when group is instanciated.
> +        */
> +#ifndef CONFIG_GROUP_IOSCHED
> +       blk_init_request_list(&q->rq);
> +#endif
> +       q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
> +                               mempool_alloc_slab, mempool_free_slab,
> +                               request_cachep, q->node);
>
> -       if (!rl->rq_pool)
> +       if (!q->rq_data.rq_pool)
>                return -ENOMEM;
>
>        return 0;
> @@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
>        q->queue_flags          = QUEUE_FLAG_DEFAULT;
>        q->queue_lock           = lock;
>
> +       /* init starved waiter wait queue */
> +       init_waitqueue_head(&q->rq_data.starved_wait);
> +
>        /*
>         * This also sets hw/phys segments, boundary and size
>         */
> @@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
>  {
>        if (rq->cmd_flags & REQ_ELVPRIV)
>                elv_put_request(q, rq);
> -       mempool_free(rq, q->rq.rq_pool);
> +       mempool_free(rq, q->rq_data.rq_pool);
>  }
>
>  static struct request *
>  blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>                                        gfp_t gfp_mask)
>  {
> -       struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> +       struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
>
>        if (!rq)
>                return NULL;
> @@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
>
>        if (priv) {
>                if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
> -                       mempool_free(rq, q->rq.rq_pool);
> +                       mempool_free(rq, q->rq_data.rq_pool);
>                        return NULL;
>                }
>                rq->cmd_flags |= REQ_ELVPRIV;
> @@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>        ioc->last_waited = jiffies;
>  }
>
> -static void __freed_request(struct request_queue *q, int sync)
> +static void __freed_request(struct request_queue *q, int sync,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> -       if (rl->count[sync] < queue_congestion_off_threshold(q))
> +       if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, sync);
>
> -       if (rl->count[sync] + 1 <= q->nr_requests) {
> +       if (q->rq_data.count[sync] + 1 <= q->nr_requests)
> +               blk_clear_queue_full(q, sync);
> +
> +       if (rl->count[sync] + 1 <= q->nr_group_requests) {
>                if (waitqueue_active(&rl->wait[sync]))
>                        wake_up(&rl->wait[sync]);
> -
> -               blk_clear_queue_full(q, sync);
>        }
>  }
>
> @@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int sync, int priv)
> +static void freed_request(struct request_queue *q, int sync, int priv,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> +       /*
> +        * There is a window during request allocation where request is
> +        * mapped to one group but by the time a queue for the group is
> +        * allocated, it is possible that original cgroup/io group has been
> +        * deleted and now io queue is allocated in a different group (root)
> +        * altogether.
> +        *
> +        * One solution to the problem is that rq should take io group
> +        * reference. But it looks too much to do that to solve this issue.
> +        * The only side affect to the hard to hit issue seems to be that
> +        * we will try to decrement the rl->count for a request list which
> +        * did not allocate that request. Chcek for rl->count going less than
> +        * zero and do not decrement it if that's the case.
> +        */
> +
> +       if (priv && rl->count[sync] > 0)
> +               rl->count[sync]--;
> +
> +       BUG_ON(!q->rq_data.count[sync]);
> +       q->rq_data.count[sync]--;
>
> -       rl->count[sync]--;
>        if (priv)
> -               rl->elvpriv--;
> +               q->rq_data.elvpriv--;
>
> -       __freed_request(q, sync);
> +       __freed_request(q, sync, rl);
>
>        if (unlikely(rl->starved[sync ^ 1]))
> -               __freed_request(q, sync ^ 1);
> +               __freed_request(q, sync ^ 1, rl);
> +
> +       /* Wake up the starved process on global list, if any */
> +       if (unlikely(q->rq_data.starved)) {
> +               if (waitqueue_active(&q->rq_data.starved_wait))
> +                       wake_up(&q->rq_data.starved_wait);
> +               q->rq_data.starved--;
> +       }
> +}
> +
> +/*
> + * Returns whether one can sleep on this request list or not. There are
> + * cases (elevator switch) where request list might not have allocated
> + * any request descriptor but we deny request allocation due to gloabl
> + * limits. In that case one should sleep on global list as on this request
> + * list no wakeup will take place.
> + *
> + * Also sets the request list starved flag if there are no requests pending
> + * in the direction of rq.
> + *
> + * Return 1 --> sleep on request list, 0 --> sleep on global list
> + */
> +static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
> +{
> +       if (unlikely(rl->count[is_sync] == 0)) {
> +               /*
> +                * If there is a request pending in other direction
> +                * in same io group, then set the starved flag of
> +                * the group request list. Otherwise, we need to
> +                * make this process sleep in global starved list
> +                * to make sure it will not sleep indefinitely.
> +                */
> +               if (rl->count[is_sync ^ 1] != 0) {
> +                       rl->starved[is_sync] = 1;
> +                       return 1;
> +               } else
> +                       return 0;
> +       }
> +
> +       return 1;
>  }
>
>  /*
>  * Get a free request, queue_lock must be held.
> - * Returns NULL on failure, with queue_lock held.
> + * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
> + * in case of failure. This reason field helps caller decide to whether sleep
> + * on per group list or global per queue list.
> + * reason = 0 sleep on per group list
> + * reason = 1 sleep on global list
> + *
>  * Returns !NULL on success, with queue_lock *not held*.
>  */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
> -                                  struct bio *bio, gfp_t gfp_mask)
> +                                       struct bio *bio, gfp_t gfp_mask,
> +                                       struct request_list *rl, int *reason)
>  {
>        struct request *rq = NULL;
> -       struct request_list *rl = &q->rq;
>        struct io_context *ioc = NULL;
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        int may_queue, priv;
> +       int sleep_on_global = 0;
>
>        may_queue = elv_may_queue(q, rw_flags);
>        if (may_queue == ELV_MQUEUE_NO)
>                goto rq_starved;
>
> -       if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
> -               if (rl->count[is_sync]+1 >= q->nr_requests) {
> -                       ioc = current_io_context(GFP_ATOMIC, q->node);
> -                       /*
> -                        * The queue will fill after this allocation, so set
> -                        * it as full, and mark this process as "batching".
> -                        * This process will be allowed to complete a batch of
> -                        * requests, others will be blocked.
> -                        */
> -                       if (!blk_queue_full(q, is_sync)) {
> -                               ioc_set_batching(q, ioc);
> -                               blk_set_queue_full(q, is_sync);
> -                       } else {
> -                               if (may_queue != ELV_MQUEUE_MUST
> -                                               && !ioc_batching(q, ioc)) {
> -                                       /*
> -                                        * The queue is full and the allocating
> -                                        * process is not a "batcher", and not
> -                                        * exempted by the IO scheduler
> -                                        */
> -                                       goto out;
> -                               }
> +       if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
> +               blk_set_queue_congested(q, is_sync);
> +
> +       /* queue full seems redundant now */
> +       if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
> +               blk_set_queue_full(q, is_sync);
> +
> +       if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +               ioc = current_io_context(GFP_ATOMIC, q->node);
> +               /*
> +                * The queue request descriptor group will fill after this
> +                * allocation, so set it as full, and mark this process as
> +                * "batching". This process will be allowed to complete a
> +                * batch of requests, others will be blocked.
> +                */
> +               if (rl->count[is_sync] <= q->nr_group_requests)
> +                       ioc_set_batching(q, ioc);
> +               else {
> +                       if (may_queue != ELV_MQUEUE_MUST
> +                                       && !ioc_batching(q, ioc)) {
> +                               /*
> +                                * The queue is full and the allocating
> +                                * process is not a "batcher", and not
> +                                * exempted by the IO scheduler
> +                                */
> +                               goto out;
>                        }
>                }
> -               blk_set_queue_congested(q, is_sync);
>        }
>
>        /*
> @@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>         * limit of requests, otherwise we could have thousands of requests
>         * allocated with any setting of ->nr_requests
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +
> +       if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
> +               /*
> +                * Queue is too full for allocation. On which request queue
> +                * the task should sleep? Generally it should sleep on its
> +                * request list but if elevator switch is happening, in that
> +                * window, request descriptors are allocated from global
> +                * pool and are not accounted against any particular request
> +                * list as group is going away.
> +                *
> +                * So it might happen that request list does not have any
> +                * requests allocated at all and if process sleeps on per
> +                * group request list, it will not be woken up. In such case,
> +                * make it sleep on global starved list.
> +                */
> +               if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
> +                   || !can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
> +               goto out;
> +       }
> +
> +       /*
> +        * Allocation of request is allowed from queue perspective. Now check
> +        * from per group request list
> +        */
> +
> +       if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
>                goto out;
>
> -       rl->count[is_sync]++;
>        rl->starved[is_sync] = 0;
>
> +       q->rq_data.count[is_sync]++;
> +
>        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -       if (priv)
> -               rl->elvpriv++;
> +       if (priv) {
> +               q->rq_data.elvpriv++;
> +               /*
> +                * Account the request to request list only if request is
> +                * going to elevator. During elevator switch, there will
> +                * be small window where group is going away and new group
> +                * will not be allocated till elevator switch is complete.
> +                * So till then instead of slowing down the application,
> +                * we will continue to allocate request from total common
> +                * pool instead of per group limit
> +                */
> +               rl->count[is_sync]++;
> +       }
>
>        if (blk_queue_io_stat(q))
>                rw_flags |= REQ_IO_STAT;
>        spin_unlock_irq(q->queue_lock);
>
>        rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
> +
>        if (unlikely(!rq)) {
>                /*
>                 * Allocation failed presumably due to memory. Undo anything
> @@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * wait queue, but this is pretty rare.
>                 */
>                spin_lock_irq(q->queue_lock);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>
>                /*
>                 * in the very unlikely event that allocation failed and no
> @@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>                 * rq mempool into READ and WRITE
>                 */
>  rq_starved:
> -               if (unlikely(rl->count[is_sync] == 0))
> -                       rl->starved[is_sync] = 1;
> -
> +               if (!can_sleep_on_request_list(rl, is_sync))
> +                       sleep_on_global = 1;
>                goto out;
>        }
>
> @@ -810,6 +951,8 @@ rq_starved:
>
>        trace_block_getrq(q, bio, rw_flags & 1);
>  out:
> +       if (reason && sleep_on_global)
> +               *reason = 1;
>        return rq;
>  }
>
> @@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                                        struct bio *bio)
>  {
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
> +       int sleep_on_global = 0;
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, bio);
>
> -       rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +       rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
>        while (!rq) {
>                DEFINE_WAIT(wait);
>                struct io_context *ioc;
> -               struct request_list *rl = &q->rq;
>
> -               prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> -                               TASK_UNINTERRUPTIBLE);
> +               if (sleep_on_global) {
> +                       /*
> +                        * Task failed allocation and needs to wait and
> +                        * try again. There are no requests pending from
> +                        * the io group hence need to sleep on global
> +                        * wait queue. Most likely the allocation failed
> +                        * because of memory issues.
> +                        */
> +
> +                       q->rq_data.starved++;
> +                       prepare_to_wait_exclusive(&q->rq_data.starved_wait,
> +                                       &wait, TASK_UNINTERRUPTIBLE);
> +               } else {
> +                       /*
> +                        * We are about to sleep on a request list and we
> +                        * drop queue lock. After waking up, we will do
> +                        * finish_wait() on request list and in the mean
> +                        * time group might be gone. Take a reference to
> +                        * the group now.
> +                        */
> +                       prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +                       elv_get_rl_iog(rl);
> +               }
>
>                trace_block_sleeprq(q, bio, rw_flags & 1);
>
> @@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>                ioc_set_batching(q, ioc);
>
>                spin_lock_irq(q->queue_lock);
> -               finish_wait(&rl->wait[is_sync], &wait);
>
> -               rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +               if (sleep_on_global) {
> +                       finish_wait(&q->rq_data.starved_wait, &wait);
> +                       sleep_on_global = 0;
> +               } else {
> +                       /*
> +                        * We had taken a reference to the rl/iog. Put that now
> +                        */
> +                       finish_wait(&rl->wait[is_sync], &wait);
> +                       elv_put_rl_iog(rl);
> +               }
> +
> +               /*
> +                * After the sleep check the rl again in case cgrop bio
> +                * belonged to is gone and it is mapped to root group now
> +                */
> +               rl = blk_get_request_list(q, bio);
> +               rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
> +                                       &sleep_on_global);
>        };
>
>        return rq;
> @@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>        struct request *rq;
> +       struct request_list *rl;
>
>        BUG_ON(rw != READ && rw != WRITE);
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        if (gfp_mask & __GFP_WAIT) {
>                rq = get_request_wait(q, rw, NULL);
>        } else {
> -               rq = get_request(q, rw, NULL, gfp_mask);
> +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
>                if (!rq)
>                        spin_unlock_irq(q->queue_lock);
>        }
> @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>        if (req->cmd_flags & REQ_ALLOCED) {
>                int is_sync = rq_is_sync(req) != 0;
>                int priv = req->cmd_flags & REQ_ELVPRIV;
> +               struct request_list *rl = rq_rl(q, req);
>
>                BUG_ON(!list_empty(&req->queuelist));
>                BUG_ON(!hlist_unhashed(&req->hash));
>
>                blk_free_request(q, req);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);

We have a potential memory bug here. freed_request should be called
before blk_free_request as blk_free_request might result in release of
cgroup, and request_list. Calling freed_request after blk_free_request
would result in operations on freed memory.

>        }
>  }
>  EXPORT_SYMBOL_GPL(__blk_put_request);
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 476d870..c3102c7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>         * set defaults
>         */
>        q->nr_requests = BLKDEV_MAX_RQ;
> +       q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>
>        q->make_request_fn = mfn;
>        blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 418d636..f3db7f0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  {
> -       struct request_list *rl = &q->rq;
> +       struct request_list *rl;
>        unsigned long nr;
>        int ret = queue_var_store(&nr, page, count);
>        if (nr < BLKDEV_MIN_RQ)
>                nr = BLKDEV_MIN_RQ;
>
>        spin_lock_irq(q->queue_lock);
> +       rl = blk_get_request_list(q, NULL);
>        q->nr_requests = nr;
>        blk_queue_congestion_threshold(q);
>
> -       if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_SYNC);
> -       else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_SYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_SYNC);
>
> -       if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_ASYNC);
> -       else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_ASYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_ASYNC);
>
> -       if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_SYNC);
> -       } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_SYNC);
>                wake_up(&rl->wait[BLK_RW_SYNC]);
>        }
>
> -       if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_ASYNC);
> -       } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_ASYNC);
>                wake_up(&rl->wait[BLK_RW_ASYNC]);
>        }
>        spin_unlock_irq(q->queue_lock);
>        return ret;
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +                                       size_t count)
> +{
> +       unsigned long nr;
> +       int ret = queue_var_store(&nr, page, count);
> +
> +       if (nr < BLKDEV_MIN_RQ)
> +               nr = BLKDEV_MIN_RQ;
> +
> +       spin_lock_irq(q->queue_lock);
> +       q->nr_group_requests = nr;
> +       spin_unlock_irq(q->queue_lock);
> +       return ret;
> +}
> +#endif
>
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
>  {
> @@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
>        .store = queue_requests_store,
>  };
>
> +#ifdef CONFIG_GROUP_IOSCHED
> +static struct queue_sysfs_entry queue_group_requests_entry = {
> +       .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_group_requests_show,
> +       .store = queue_group_requests_store,
> +};
> +#endif
> +
>  static struct queue_sysfs_entry queue_ra_entry = {
>        .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>        .show = queue_ra_show,
> @@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> +#ifdef CONFIG_GROUP_IOSCHED
> +       &queue_group_requests_entry.attr,
> +#endif
>        &queue_ra_entry.attr,
>        &queue_max_hw_sectors_entry.attr,
>        &queue_max_sectors_entry.attr,
> @@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
>  {
>        struct request_queue *q =
>                container_of(kobj, struct request_queue, kobj);
> -       struct request_list *rl = &q->rq;
>
>        blk_sync_queue(q);
>
> -       if (rl->rq_pool)
> -               mempool_destroy(rl->rq_pool);
> +       if (q->rq_data.rq_pool)
> +               mempool_destroy(q->rq_data.rq_pool);
>
>        if (q->queue_tags)
>                __blk_queue_free_tags(q);
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9c8783c..39896c2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -925,6 +925,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
>                            struct io_cgroup, css);
>  }
>
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               iog = q->elevator->efqd->root_group;
> +       else
> +               iog = elv_io_get_io_group_bio(q, bio, 1);
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
> +{
> +       struct io_group *iog;
> +
> +       if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +               return &q->elevator->efqd->root_group->rl;
> +
> +       BUG_ON(priv && !rq->ioq);
> +
> +       if (priv)
> +               iog = ioq_to_io_group(rq->ioq);
> +       else
> +               iog = q->elevator->efqd->root_group;
> +
> +       BUG_ON(!iog);
> +       return &iog->rl;
> +}
> +
>  /*
>  * Search the io_group for efqd into the hash table (by now only a list)
>  * of bgrp.  Must be called under rcu_read_lock().
> @@ -1281,6 +1314,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>                elv_get_iog(iog);
>                io_group_path(iog);
>
> +               blk_init_request_list(&iog->rl);
> +
>                if (leaf == NULL) {
>                        leaf = iog;
>                        prev = leaf;
> @@ -1502,6 +1537,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>        for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>                iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
>
> +       blk_init_request_list(&iog->rl);
>        spin_lock_irq(&iocg->lock);
>        rcu_assign_pointer(iog->key, key);
>        hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 9fe52fa..989102e 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -128,6 +128,9 @@ struct io_group {
>
>        /* Single ioq per group, used for noop, deadline, anticipatory */
>        struct io_queue *ioq;
> +
> +       /* request list associated with the group */
> +       struct request_list rl;
>  };
>
>  struct io_cgroup {
> @@ -425,11 +428,31 @@ static inline void elv_get_iog(struct io_group *iog)
>        atomic_inc(&iog->ref);
>  }
>
> +static inline struct io_group *rl_iog(struct request_list *rl)
> +{
> +       return container_of(rl, struct io_group, rl);
> +}
> +
> +static inline void elv_get_rl_iog(struct request_list *rl)
> +{
> +       elv_get_iog(rl_iog(rl));
> +}
> +
> +static inline void elv_put_rl_iog(struct request_list *rl)
> +{
> +       elv_put_iog(rl_iog(rl));
> +}
> +
>  extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
>                                        struct bio *bio, gfp_t gfp_mask);
>  extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
>  extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>                                                struct bio *bio);
> +struct request_list *
> +elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
> +
> +struct request_list *
> +elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
>
>  #else /* !GROUP_IOSCHED */
>
> @@ -469,6 +492,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* GROUP_IOSCHED */
>
>  extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
> @@ -578,6 +604,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>        return NULL;
>  }
>
> +static inline void elv_get_rl_iog(struct request_list *rl) { }
> +static inline void elv_put_rl_iog(struct request_list *rl) { }
> +
>  #endif /* CONFIG_ELV_FAIR_QUEUING */
>  #endif /* _ELV_SCHED_H */
>  #endif /* CONFIG_BLOCK */
> diff --git a/block/elevator.c b/block/elevator.c
> index 4ed37b6..b23db03 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
>         * make sure we don't have any requests in flight
>         */
>        elv_drain_elevator(q);
> -       while (q->rq.elvpriv) {
> +       while (q->rq_data.elvpriv) {
>                __blk_run_queue(q);
>                spin_unlock_irq(q->queue_lock);
>                msleep(10);
> @@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
>        }
>
>        if (unplug_it && blk_queue_plugged(q)) {
> -               int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
> -                               - queue_in_flight(q);
> +               int nrq = q->rq_data.count[BLK_RW_SYNC] +
> +                               q->rq_data.count[BLK_RW_ASYNC] -
> +                               queue_in_flight(q);
>
>                if (nrq >= q->unplug_thresh)
>                        __generic_unplug_device(q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7cff5f2..74deb17 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -32,21 +32,51 @@ struct request;
>  struct sg_io_hdr;
>
>  #define BLKDEV_MIN_RQ  4
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +#define BLKDEV_MAX_RQ  512     /* Default maximum for queue */
> +#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
> +#else
>  #define BLKDEV_MAX_RQ  128     /* Default maximum */
> +/*
> + * This is eqivalent to case of only one group present (root group). Let
> + * it consume all the request descriptors available on the queue .
> + */
> +#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
> +#endif
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
>
>  struct request_list {
>        /*
> -        * count[], starved[], and wait[] are indexed by
> +        * count[], starved and wait[] are indexed by
>         * BLK_RW_SYNC/BLK_RW_ASYNC
>         */
>        int count[2];
>        int starved[2];
> +       wait_queue_head_t wait[2];
> +};
> +
> +/*
> + * This data structures keeps track of mempool of requests for the queue
> + * and some overall statistics.
> + */
> +struct request_data {
> +       /*
> +        * Per queue request descriptor count. This is in addition to per
> +        * cgroup count
> +        */
> +       int count[2];
>        int elvpriv;
>        mempool_t *rq_pool;
> -       wait_queue_head_t wait[2];
> +       int starved;
> +       /*
> +        * Global list for starved tasks. A task will be queued here if
> +        * it could not allocate request descriptor and the associated
> +        * group request list does not have any requests pending.
> +        */
> +       wait_queue_head_t starved_wait;
>  };
>
>  /*
> @@ -339,10 +369,17 @@ struct request_queue
>        struct request          *last_merge;
>        struct elevator_queue   *elevator;
>
> +#ifndef CONFIG_GROUP_IOSCHED
>        /*
>         * the queue request freelist, one for reads and one for writes
> +        * In case of group io scheduling, this request list is per group
> +        * and is present in group data structure.
>         */
>        struct request_list     rq;
> +#endif
> +
> +       /* Contains request pool and other data like starved data */
> +       struct request_data     rq_data;
>
>        request_fn_proc         *request_fn;
>        make_request_fn         *make_request_fn;
> @@ -405,6 +442,8 @@ struct request_queue
>         * queue settings
>         */
>        unsigned long           nr_requests;    /* Max # of requests */
> +       /* Max # of per io group requests */
> +       unsigned long           nr_group_requests;
>        unsigned int            nr_congestion_on;
>        unsigned int            nr_congestion_off;
>        unsigned int            nr_batching;
> @@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>  extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>                         struct scsi_ioctl_command __user *);
>
> +extern void blk_init_request_list(struct request_list *rl);
> +
> +extern struct request_list *blk_get_request_list(struct request_queue *q,
> +                                                       struct bio *bio);
>  /*
>  * A queue has just exitted congestion.  Note this in the global counter of
>  * congested queues, and wake up anyone who was waiting for requests to be
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 9a74b46..af6c9e5 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> @@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
>        ),
>
>        TP_fast_assign(
> -               __entry->nr_rq  = q->rq.count[READ] + q->rq.count[WRITE];
> +               __entry->nr_rq  = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>        ),
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7a34cb5..9a03980 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
> @@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
>        struct blk_trace *bt = q->blk_trace;
>
>        if (bt) {
> -               unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
> +               unsigned int pdu = q->rq_data.count[READ] +
> +                                       q->rq_data.count[WRITE];
>                __be64 rpdu = cpu_to_be64(pdu);
>
>                __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]           ` <4AA9A4BE.30005-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-09-14  2:44             ` Vivek Goyal
@ 2009-09-15  3:37             ` Vivek Goyal
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-15  3:37 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >>>> Hi Vivek,
> >>>>
> >>>> I happened to encount a bug when i test IO Controller V9.
> >>>> When there are three tasks to run concurrently in three group,
> >>>> that is, one is parent group, and other two tasks are running 
> >>>> in two different child groups respectively to read or write 
> >>>> files in some disk, say disk "hdb", The task may hang up, and 
> >>>> other tasks which access into "hdb" will also hang up.
> >>>>
> >>>> The bug only happens when using AS io scheduler.
> >>>> The following scirpt can reproduce this bug in my box.
> >>>>
> >>> Hi Gui,
> >>>
> >>> I tried reproducing this on my system and can't reproduce it. All the
> >>> three processes get killed and system does not hang.
> >>>
> >>> Can you please dig deeper a bit into it. 
> >>>
> >>> - If whole system hangs or it is just IO to disk seems to be hung.
> >>     Only when the task is trying do IO to disk it will hang up.
> >>
> >>> - Does io scheduler switch on the device work
> >>     yes, io scheduler can be switched, and the hung task will be resumed.
> >>
> >>> - If the system is not hung, can you capture the blktrace on the device.
> >>>   Trace might give some idea, what's happening.
> >> I run a "find" task to do some io on that disk, it seems that task hangs 
> >> when it is issuing getdents() syscall.
> >> kernel generates the following message:
> >>
> >> INFO: task find:3260 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> find          D a1e95787  1912  3260   2897 0x00000004
> >>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
> >>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
> >>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> >> Call Trace:
> >>  [<c0447323>] ? getnstimeofday+0x57/0xe0
> >>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
> >>  [<c068ab68>] io_schedule+0x47/0x79
> >>  [<c04c12ee>] sync_buffer+0x36/0x3a
> >>  [<c068ae14>] __wait_on_bit+0x36/0x5d
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
> >>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
> >>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
> >>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
> >>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
> >>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
> >>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
> >>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
> >>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
> >>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
> >>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
> >>  [<c04b12db>] ? vfs_readdir+0x46/0x94
> >>  [<c04b12fd>] vfs_readdir+0x68/0x94
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c04b1387>] sys_getdents64+0x5e/0x9f
> >>  [<c04028b4>] sysenter_do_call+0x12/0x32
> >> 1 lock held by find/3260:
> >>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> >>
> >> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> >> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> > 
> > Thanks Gui. As Jens said, it does look like a case of missing queue
> > restart somewhere and now we are stuck, no requests are being dispatched
> > to the disk and queue is already unplugged.
> > 
> > Can you please also try capturing the trace of events at io scheduler
> > (blktrace) to see how did we get into that situation.
> > 
> > Are you using ide drivers and not libata? As jens said, I will try to make
> > use of ide drivers and see if I can reproduce it.
> > 
> 
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

Hi Gui,

I have modified your patch a bit to improve readability. Looking at the
issue closely I realized that this optimization of not expiring the 
queue can lead to other issues like high vdisktime in certain scenarios.
While fixing that also noticed the issue of high rate of as queue
expiration in certain cases which could have been avoided. 

Here is a patch which should fix all that. I am still testing this patch
to make sure that something is not obiviously broken. Will merge it if
there are no issues.

Thanks
Vivek

o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
  and fixed by Gui.

o If an AS queue is not expired for a long time and suddenly somebody
  decides to create a group and launch a job there, in that case old AS
  queue will be expired with a very high value of slice used and will get
  a very high disk time. Fix it by marking the queue as "charge_one_slice"
  and charge the queue only for a single time slice and not for whole
  of the duration when queue was running.

o There are cases where in case of AS, excessive queue expiration will take
  place by elevator fair queuing layer because of few reasons.
	- AS does not anticipate on a queue if there are no competing requests.
	  So if only a single reader is present in a group, anticipation does
	  not get turn on.

	- elevator layer does not know that As is anticipating hence initiates
	  expiry requests in select_ioq() thinking queue is empty.

	- elevaotr layer tries to aggressively expire last empty queue. This
	  can lead to lof of queue expiry

o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
  queue completed and associated io context is eligible to anticipate. Also
  AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
  . This solves above mentioned issues.
 
o Moved some of the code in separate functions to improve readability.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c  |   93 +++++++++++++++++++++++++++++---
 block/elevator-fq.c |  150 +++++++++++++++++++++++++++++++++++++++++-----------
 block/elevator-fq.h |    3 +
 3 files changed, 210 insertions(+), 36 deletions(-)

Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-09-14 23:09:08.000000000 -0400
@@ -472,19 +472,18 @@ static inline void debug_entity_vdisktim
 					unsigned long served, u64 delta) {}
 #endif /* DEBUG_ELV_FAIR_QUEUING */
 
-static void
-entity_served(struct io_entity *entity, unsigned long served,
-				unsigned long nr_sectors)
+static void entity_served(struct io_entity *entity, unsigned long real_served,
+		unsigned long virtual_served, unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
 		u64 delta;
 
-		delta = elv_delta_fair(served, entity);
+		delta = elv_delta_fair(virtual_served, entity);
 		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
-		entity->total_time += served;
+		entity->total_time += real_served;
 		entity->total_sectors += nr_sectors;
-		debug_entity_vdisktime(entity, served, delta);
+		debug_entity_vdisktime(entity, virtual_served, delta);
 	}
 }
 
@@ -928,7 +927,24 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+	unsigned long virtual_served = served, allocated_slice;
+
+	/*
+	 * For single ioq schedulers we don't expire the queue if there are
+	 * no other competing groups. It might happen that once a queue has
+	 * not been expired for a long time, suddenly a new group is created
+	 * and IO comes in that new group. In that case, we don't want to
+	 * charge the old queue for whole of the period it was not expired.
+	 */
+	if (elv_ioq_charge_one_slice(ioq)) {
+		allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);
+		if (served > allocated_slice)
+			virtual_served = allocated_slice;
+		elv_clear_ioq_charge_one_slice(ioq);
+	}
+
+	entity_served(&ioq->entity, served, virtual_served, ioq->nr_sectors);
 	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
 			served, ioq->nr_sectors, ioq->nr_queued);
 	print_ioq_service_stats(ioq);
@@ -2543,6 +2559,22 @@ alloc_sched_q:
 		elv_init_ioq_io_group(ioq, iog);
 		elv_init_ioq_sched_queue(e, ioq, sched_q);
 
+		/*
+		 * For AS, also mark the group queue idle_window. This will
+		 * make sure that select_ioq() will not try to expire an
+		 * AS queue if there are dispatched request from the queue but
+		 * queue is empty. This gives a chance to asq to anticipate
+		 * after request completion, otherwise select_ioq() will
+		 * mark it must_expire and soon asq will be expired.
+		 *
+		 *  Not doing it for noop and deadline yet as they don't have
+		 *  any anticpation logic and this will slow down queue
+		 *  switching in a NCQ supporting hardware.
+		 */
+		if (!strcmp(e->elevator_type->elevator_name, "anticipatory")) {
+			elv_mark_ioq_idle_window(ioq);
+		}
+
 		elv_io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 		elv_get_iog(iog);
@@ -2664,6 +2696,12 @@ static inline int is_only_root_group(voi
 
 #endif /* CONFIG_GROUP_IOSCHED */
 
+static inline int ioq_is_idling(struct io_queue *ioq)
+{
+	return (elv_ioq_wait_request(ioq) ||
+			timer_pending(&ioq->efqd->idle_slice_timer));
+}
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -2835,7 +2873,6 @@ elv_iosched_expire_ioq(struct request_qu
 		if (!ret)
 			elv_mark_ioq_must_expire(ioq);
 	}
-
 	return ret;
 }
 
@@ -3078,6 +3115,7 @@ void elv_ioq_request_add(struct request_
 		 */
 		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_request(ioq);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
 				__blk_run_queue(q);
@@ -3121,6 +3159,7 @@ static void elv_idle_slice_timer(unsigne
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 
+		elv_clear_ioq_wait_request(ioq);
 		elv_clear_iog_wait_request(iog);
 
 		if (elv_iog_wait_busy(iog)) {
@@ -3222,6 +3261,28 @@ static inline struct io_queue *elv_close
 	return new_ioq;
 }
 
+/*
+ * One can do some optimizations for single ioq scheduler, when one does
+ * not have to expire the queue after every time slice is used. This avoids
+ * some unnecessary overhead, especially in AS where we wait for requests to
+ * finish from last queue before new queue is scheduled in.
+ */
+static inline int single_ioq_no_timed_expiry(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_single_ioq(q->elevator))
+		return 0;
+
+	if (!is_only_root_group())
+		return 0;
+
+	if (efqd->busy_queues == 1)
+		return 1;
+
+	return 0;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_select_ioq(struct request_queue *q, int force)
 {
@@ -3229,7 +3290,7 @@ void *elv_select_ioq(struct request_queu
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
  	struct elevator_type *e = q->elevator->elevator_type;
- 	int slice_expired = 1;
+ 	int slice_expired = 0;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3255,16 +3316,20 @@ void *elv_select_ioq(struct request_queu
 	}
 
 	/* This queue has been marked for expiry. Try to expire it */
-	if (elv_ioq_must_expire(ioq))
+	if (elv_ioq_must_expire(ioq)) {
+		elv_log_ioq(efqd, ioq, "select: ioq must_expire. expire");
 		goto expire;
+	}
 
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (single_ioq_no_timed_expiry(q)) {
+		elv_mark_ioq_charge_one_slice(ioq);
 		goto keep_queue;
+	}
 
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
@@ -3301,6 +3366,7 @@ void *elv_select_ioq(struct request_queu
 		 * from queue and is not proportional to group's weight, it
 		 * harms the fairness of the group.
 		 */
+		slice_expired = 1;
 		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
 			ioq = NULL;
 			goto keep_queue;
@@ -3332,7 +3398,7 @@ void *elv_select_ioq(struct request_queu
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
 
-	if (timer_pending(&efqd->idle_slice_timer) ||
+	if (ioq_is_idling(ioq) ||
 	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
 		ioq = NULL;
 		goto keep_queue;
@@ -3344,7 +3410,6 @@ void *elv_select_ioq(struct request_queu
 		goto keep_queue;
 	}
 
-	slice_expired = 0;
 expire:
  	if (efqd->fairness && !force && ioq && ioq->dispatched
  	    && strcmp(e->elevator_name, "anticipatory")) {
@@ -3439,6 +3504,43 @@ void elv_deactivate_rq_fair(struct reque
 						efqd->rq_in_driver);
 }
 
+/*
+ * if this is only queue and it has completed all its requests and has nothing
+ * to dispatch, expire it. We don't want to keep it around idle otherwise later
+ * when it is expired, all this idle time will be added to queue's disk time
+ * used and queue might not get a chance to run for a long time.
+ */
+static inline void
+check_expire_last_empty_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (efqd->busy_queues != 1)
+		return;
+
+	if (ioq->dispatched || ioq->nr_queued)
+		return;
+
+	/*
+	 * Anticipation is on. Don't expire queue. Either a new request will
+	 * come or it is up to io scheduler to expire the queue once idle
+	 * timer fires
+	 */
+
+	if(ioq_is_idling(ioq))
+		return;
+
+	/*
+	 * If IO scheduler denies expiration here, it is up to io scheduler
+	 * to expire the queue when possible. Otherwise all the idle time
+	 * will be charged to the queue when queue finally expires.
+	 */
+	if (elv_iosched_expire_ioq(q, 0, 0)) {
+		elv_log_ioq(efqd, ioq, "expire last empty queue");
+		elv_slice_expired(q);
+	}
+}
+
 /* A request got completed from io_queue. Do the accounting. */
 void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 {
@@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
 			elv_set_prio_slice(q->elevator->efqd, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
 		/*
 		 * If there is only root group present, don't expire the queue
 		 * for single queue ioschedulers (noop, deadline, AS). It is
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
-			elv_iosched_single_ioq(q->elevator)) {
-			elv_log_ioq(efqd, ioq, "select: only root group,"
-					" no expiry");
+		if (single_ioq_no_timed_expiry(q)) {
+			elv_mark_ioq_charge_one_slice(ioq);
+			elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
 			goto done;
 		}
 
@@ -3519,7 +3621,7 @@ void elv_ioq_completed_request(struct re
 		 * decide to idle on queue, idle on group.
 		 */
 		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
-		    && !timer_pending(&efqd->idle_slice_timer)) {
+		    && !ioq_is_idling(ioq)) {
 			/*
 			 * If queue has used up its slice, wait for the
 			 * one extra group_idle period to let the group
@@ -3532,17 +3634,7 @@ void elv_ioq_completed_request(struct re
 				elv_iog_arm_slice_timer(q, iog, 0);
 		}
 
-		/*
-		 * if this is only queue and it has completed all its requests
-		 * and has nothing to dispatch, expire it. We don't want to
-		 * keep it around idle otherwise later when it is expired, all
-		 * this idle time will be added to queue's disk time used.
-		 */
-		if (efqd->busy_queues == 1 && !ioq->dispatched &&
-		   !ioq->nr_queued && !timer_pending(&efqd->idle_slice_timer)) {
-			if (elv_iosched_expire_ioq(q, 0, 0))
-				elv_slice_expired(q);
-		}
+		check_expire_last_empty_queue(q, ioq);
 	}
 done:
 	if (!efqd->rq_in_driver)
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-09-14 23:13:08.000000000 -0400
@@ -187,6 +187,24 @@ static void as_antic_stop(struct as_data
 static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
 
 #ifdef CONFIG_IOSCHED_AS_HIER
+static int as_can_anticipate(struct as_data *ad, struct request *rq);
+static void as_antic_waitnext(struct as_data *ad);
+
+static inline void as_mark_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	elv_mark_ioq_wait_request(asq->ioq);
+}
+
+static inline void as_clear_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	if (asq)
+		elv_clear_ioq_wait_request(asq->ioq);
+}
+
 static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 {
 	/* Save batch data dir */
@@ -279,6 +297,29 @@ static void as_active_ioq_set(struct req
 }
 
 /*
+ * AS does not anticipate on a context if there is no other request pending.
+ * So if only a single sequential reader was running, AS will not turn on
+ * anticipation. This function turns on anticipation if an io context has
+ * think time with-in limits and there are no other requests to dispatch.
+ *
+ * With group scheduling, a queue is expired if is empty, does not have a
+ * request dispatched and we are not idling. In case of this single reader
+ * we will see a queue expiration after every request completion. Hence turn
+ * on the anticipation if an io context should ancipate and there are no
+ * other requests queued in the queue.
+ */
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+
+	if (!ad->nr_dispatched && !asq->nr_queued[1] && !asq->nr_queued[0] &&
+	    as_can_anticipate(ad, NULL)) {
+		as_antic_waitnext(ad);
+	}
+}
+
+/*
  * This is a notification from common layer that it wishes to expire this
  * io queue. AS decides whether queue can be expired, if yes, it also
  * saves the batch context.
@@ -325,13 +366,18 @@ static int as_expire_ioq(struct request_
 		goto keep_queue;
 
 	/*
-	 * If AS anticipation is ON, wait for it to finish.
+	 * If AS anticipation is ON, wait for it to finish if queue slice
+	 * has not expired.
 	 */
 	BUG_ON(status == ANTIC_WAIT_REQ);
 
-	if (status == ANTIC_WAIT_NEXT)
-		goto keep_queue;
-
+	if (status == ANTIC_WAIT_NEXT) {
+		if (!slice_expired)
+			goto keep_queue;
+		/* Slice expired. Stop anticipating. */
+		as_antic_stop(ad);
+		ad->antic_status = ANTIC_OFF;
+	}
 	/* We are good to expire the queue. Save batch context */
 	as_save_batch_context(ad, asq);
 	ad->switch_queue = 0;
@@ -342,6 +388,33 @@ keep_queue:
 	ad->switch_queue = 1;
 	return 0;
 }
+
+static inline void as_check_expire_active_as_queue(struct request_queue *q)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_active_sched_queue(q->elevator);
+
+	/*
+	 * We anticpated on the queue and timer fired. If queue is empty,
+	 * expire the queue. This will make sure an idle queue does not
+	 * remain active for a very long time as later all the idle time
+	 * can be added to the queue disk usage.
+	 */
+	if (asq) {
+		if (!ad->nr_dispatched && !asq->nr_queued[1] &&
+		    !asq->nr_queued[0]) {
+			ad->switch_queue = 0;
+			elv_ioq_slice_expired(q, asq->ioq);
+		}
+	}
+}
+
+#else /* CONFIG_IOSCHED_AS_HIER */
+static inline void as_mark_active_asq_wait_request(struct as_data *ad) {}
+static inline void as_clear_active_asq_wait_request(struct as_data *ad) {}
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq) {}
+static inline void as_check_expire_active_as_queue(struct request_queue *q) {}
 #endif
 
 /*
@@ -622,6 +695,7 @@ static void as_antic_waitnext(struct as_
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_mark_active_asq_wait_request(ad);
 	as_log(ad, "antic_waitnext set");
 }
 
@@ -656,6 +730,7 @@ static void as_antic_stop(struct as_data
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
+		as_clear_active_asq_wait_request(ad);
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
 		kblockd_schedule_work(ad->q, &ad->antic_work);
@@ -672,7 +747,7 @@ static void as_antic_timeout(unsigned lo
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	as_log(ad, "as_antic_timeout");
+	as_log(ad, "as_antic_timeout. antic_status=%d", ad->antic_status);
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -680,6 +755,9 @@ static void as_antic_timeout(unsigned lo
 		aic = ad->io_context->aic;
 
 		ad->antic_status = ANTIC_FINISHED;
+
+		as_clear_active_asq_wait_request(ad);
+		as_check_expire_active_as_queue(q);
 		kblockd_schedule_work(q, &ad->antic_work);
 
 		if (aic->ttime_samples == 0) {
@@ -690,6 +768,7 @@ static void as_antic_timeout(unsigned lo
 			/* process not "saved" by a cooperating request */
 			ad->exit_no_coop = (7*ad->exit_no_coop + 256)/8;
 		}
+
 		spin_unlock(&ad->io_context->lock);
 	}
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1122,7 +1201,8 @@ static void as_completed_request(struct 
 			 * the next one
 			 */
 			as_antic_waitnext(ad);
-		}
+		} else
+			as_hier_check_start_waitnext(q, asq);
 	}
 
 	as_put_io_context(rq);
@@ -1471,7 +1551,6 @@ static void as_add_request(struct reques
 	data_dir = rq_is_sync(rq);
 
 	rq->elevator_private = as_get_io_context(q->node);
-
 	asq->nr_queued[data_dir]++;
 	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-09-14 15:50:04.000000000 -0400
@@ -264,6 +264,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
+	ELV_QUEUE_FLAG_charge_one_slice,  /* Charge the queue for only one
+					   * time slice length */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -287,6 +289,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 ELV_IO_QUEUE_FLAG_FNS(must_expire)
+ELV_IO_QUEUE_FLAG_FNS(charge_one_slice)
 
 #ifdef CONFIG_GROUP_IOSCHED

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-11  1:15         ` Gui Jianfeng
@ 2009-09-15  3:37             ` Vivek Goyal
       [not found]           ` <4AA9A4BE.30005-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-09-15  3:37             ` Vivek Goyal
  2 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-15  3:37 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >>>> Hi Vivek,
> >>>>
> >>>> I happened to encount a bug when i test IO Controller V9.
> >>>> When there are three tasks to run concurrently in three group,
> >>>> that is, one is parent group, and other two tasks are running 
> >>>> in two different child groups respectively to read or write 
> >>>> files in some disk, say disk "hdb", The task may hang up, and 
> >>>> other tasks which access into "hdb" will also hang up.
> >>>>
> >>>> The bug only happens when using AS io scheduler.
> >>>> The following scirpt can reproduce this bug in my box.
> >>>>
> >>> Hi Gui,
> >>>
> >>> I tried reproducing this on my system and can't reproduce it. All the
> >>> three processes get killed and system does not hang.
> >>>
> >>> Can you please dig deeper a bit into it. 
> >>>
> >>> - If whole system hangs or it is just IO to disk seems to be hung.
> >>     Only when the task is trying do IO to disk it will hang up.
> >>
> >>> - Does io scheduler switch on the device work
> >>     yes, io scheduler can be switched, and the hung task will be resumed.
> >>
> >>> - If the system is not hung, can you capture the blktrace on the device.
> >>>   Trace might give some idea, what's happening.
> >> I run a "find" task to do some io on that disk, it seems that task hangs 
> >> when it is issuing getdents() syscall.
> >> kernel generates the following message:
> >>
> >> INFO: task find:3260 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> find          D a1e95787  1912  3260   2897 0x00000004
> >>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
> >>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
> >>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> >> Call Trace:
> >>  [<c0447323>] ? getnstimeofday+0x57/0xe0
> >>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
> >>  [<c068ab68>] io_schedule+0x47/0x79
> >>  [<c04c12ee>] sync_buffer+0x36/0x3a
> >>  [<c068ae14>] __wait_on_bit+0x36/0x5d
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
> >>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
> >>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
> >>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
> >>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
> >>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
> >>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
> >>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
> >>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
> >>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
> >>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
> >>  [<c04b12db>] ? vfs_readdir+0x46/0x94
> >>  [<c04b12fd>] vfs_readdir+0x68/0x94
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c04b1387>] sys_getdents64+0x5e/0x9f
> >>  [<c04028b4>] sysenter_do_call+0x12/0x32
> >> 1 lock held by find/3260:
> >>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> >>
> >> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> >> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> > 
> > Thanks Gui. As Jens said, it does look like a case of missing queue
> > restart somewhere and now we are stuck, no requests are being dispatched
> > to the disk and queue is already unplugged.
> > 
> > Can you please also try capturing the trace of events at io scheduler
> > (blktrace) to see how did we get into that situation.
> > 
> > Are you using ide drivers and not libata? As jens said, I will try to make
> > use of ide drivers and see if I can reproduce it.
> > 
> 
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Hi Gui,

I have modified your patch a bit to improve readability. Looking at the
issue closely I realized that this optimization of not expiring the 
queue can lead to other issues like high vdisktime in certain scenarios.
While fixing that also noticed the issue of high rate of as queue
expiration in certain cases which could have been avoided. 

Here is a patch which should fix all that. I am still testing this patch
to make sure that something is not obiviously broken. Will merge it if
there are no issues.

Thanks
Vivek

o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
  and fixed by Gui.

o If an AS queue is not expired for a long time and suddenly somebody
  decides to create a group and launch a job there, in that case old AS
  queue will be expired with a very high value of slice used and will get
  a very high disk time. Fix it by marking the queue as "charge_one_slice"
  and charge the queue only for a single time slice and not for whole
  of the duration when queue was running.

o There are cases where in case of AS, excessive queue expiration will take
  place by elevator fair queuing layer because of few reasons.
	- AS does not anticipate on a queue if there are no competing requests.
	  So if only a single reader is present in a group, anticipation does
	  not get turn on.

	- elevator layer does not know that As is anticipating hence initiates
	  expiry requests in select_ioq() thinking queue is empty.

	- elevaotr layer tries to aggressively expire last empty queue. This
	  can lead to lof of queue expiry

o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
  queue completed and associated io context is eligible to anticipate. Also
  AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
  . This solves above mentioned issues.
 
o Moved some of the code in separate functions to improve readability.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c  |   93 +++++++++++++++++++++++++++++---
 block/elevator-fq.c |  150 +++++++++++++++++++++++++++++++++++++++++-----------
 block/elevator-fq.h |    3 +
 3 files changed, 210 insertions(+), 36 deletions(-)

Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-09-14 23:09:08.000000000 -0400
@@ -472,19 +472,18 @@ static inline void debug_entity_vdisktim
 					unsigned long served, u64 delta) {}
 #endif /* DEBUG_ELV_FAIR_QUEUING */
 
-static void
-entity_served(struct io_entity *entity, unsigned long served,
-				unsigned long nr_sectors)
+static void entity_served(struct io_entity *entity, unsigned long real_served,
+		unsigned long virtual_served, unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
 		u64 delta;
 
-		delta = elv_delta_fair(served, entity);
+		delta = elv_delta_fair(virtual_served, entity);
 		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
-		entity->total_time += served;
+		entity->total_time += real_served;
 		entity->total_sectors += nr_sectors;
-		debug_entity_vdisktime(entity, served, delta);
+		debug_entity_vdisktime(entity, virtual_served, delta);
 	}
 }
 
@@ -928,7 +927,24 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+	unsigned long virtual_served = served, allocated_slice;
+
+	/*
+	 * For single ioq schedulers we don't expire the queue if there are
+	 * no other competing groups. It might happen that once a queue has
+	 * not been expired for a long time, suddenly a new group is created
+	 * and IO comes in that new group. In that case, we don't want to
+	 * charge the old queue for whole of the period it was not expired.
+	 */
+	if (elv_ioq_charge_one_slice(ioq)) {
+		allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);
+		if (served > allocated_slice)
+			virtual_served = allocated_slice;
+		elv_clear_ioq_charge_one_slice(ioq);
+	}
+
+	entity_served(&ioq->entity, served, virtual_served, ioq->nr_sectors);
 	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
 			served, ioq->nr_sectors, ioq->nr_queued);
 	print_ioq_service_stats(ioq);
@@ -2543,6 +2559,22 @@ alloc_sched_q:
 		elv_init_ioq_io_group(ioq, iog);
 		elv_init_ioq_sched_queue(e, ioq, sched_q);
 
+		/*
+		 * For AS, also mark the group queue idle_window. This will
+		 * make sure that select_ioq() will not try to expire an
+		 * AS queue if there are dispatched request from the queue but
+		 * queue is empty. This gives a chance to asq to anticipate
+		 * after request completion, otherwise select_ioq() will
+		 * mark it must_expire and soon asq will be expired.
+		 *
+		 *  Not doing it for noop and deadline yet as they don't have
+		 *  any anticpation logic and this will slow down queue
+		 *  switching in a NCQ supporting hardware.
+		 */
+		if (!strcmp(e->elevator_type->elevator_name, "anticipatory")) {
+			elv_mark_ioq_idle_window(ioq);
+		}
+
 		elv_io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 		elv_get_iog(iog);
@@ -2664,6 +2696,12 @@ static inline int is_only_root_group(voi
 
 #endif /* CONFIG_GROUP_IOSCHED */
 
+static inline int ioq_is_idling(struct io_queue *ioq)
+{
+	return (elv_ioq_wait_request(ioq) ||
+			timer_pending(&ioq->efqd->idle_slice_timer));
+}
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -2835,7 +2873,6 @@ elv_iosched_expire_ioq(struct request_qu
 		if (!ret)
 			elv_mark_ioq_must_expire(ioq);
 	}
-
 	return ret;
 }
 
@@ -3078,6 +3115,7 @@ void elv_ioq_request_add(struct request_
 		 */
 		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_request(ioq);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
 				__blk_run_queue(q);
@@ -3121,6 +3159,7 @@ static void elv_idle_slice_timer(unsigne
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 
+		elv_clear_ioq_wait_request(ioq);
 		elv_clear_iog_wait_request(iog);
 
 		if (elv_iog_wait_busy(iog)) {
@@ -3222,6 +3261,28 @@ static inline struct io_queue *elv_close
 	return new_ioq;
 }
 
+/*
+ * One can do some optimizations for single ioq scheduler, when one does
+ * not have to expire the queue after every time slice is used. This avoids
+ * some unnecessary overhead, especially in AS where we wait for requests to
+ * finish from last queue before new queue is scheduled in.
+ */
+static inline int single_ioq_no_timed_expiry(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_single_ioq(q->elevator))
+		return 0;
+
+	if (!is_only_root_group())
+		return 0;
+
+	if (efqd->busy_queues == 1)
+		return 1;
+
+	return 0;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_select_ioq(struct request_queue *q, int force)
 {
@@ -3229,7 +3290,7 @@ void *elv_select_ioq(struct request_queu
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
  	struct elevator_type *e = q->elevator->elevator_type;
- 	int slice_expired = 1;
+ 	int slice_expired = 0;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3255,16 +3316,20 @@ void *elv_select_ioq(struct request_queu
 	}
 
 	/* This queue has been marked for expiry. Try to expire it */
-	if (elv_ioq_must_expire(ioq))
+	if (elv_ioq_must_expire(ioq)) {
+		elv_log_ioq(efqd, ioq, "select: ioq must_expire. expire");
 		goto expire;
+	}
 
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (single_ioq_no_timed_expiry(q)) {
+		elv_mark_ioq_charge_one_slice(ioq);
 		goto keep_queue;
+	}
 
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
@@ -3301,6 +3366,7 @@ void *elv_select_ioq(struct request_queu
 		 * from queue and is not proportional to group's weight, it
 		 * harms the fairness of the group.
 		 */
+		slice_expired = 1;
 		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
 			ioq = NULL;
 			goto keep_queue;
@@ -3332,7 +3398,7 @@ void *elv_select_ioq(struct request_queu
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
 
-	if (timer_pending(&efqd->idle_slice_timer) ||
+	if (ioq_is_idling(ioq) ||
 	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
 		ioq = NULL;
 		goto keep_queue;
@@ -3344,7 +3410,6 @@ void *elv_select_ioq(struct request_queu
 		goto keep_queue;
 	}
 
-	slice_expired = 0;
 expire:
  	if (efqd->fairness && !force && ioq && ioq->dispatched
  	    && strcmp(e->elevator_name, "anticipatory")) {
@@ -3439,6 +3504,43 @@ void elv_deactivate_rq_fair(struct reque
 						efqd->rq_in_driver);
 }
 
+/*
+ * if this is only queue and it has completed all its requests and has nothing
+ * to dispatch, expire it. We don't want to keep it around idle otherwise later
+ * when it is expired, all this idle time will be added to queue's disk time
+ * used and queue might not get a chance to run for a long time.
+ */
+static inline void
+check_expire_last_empty_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (efqd->busy_queues != 1)
+		return;
+
+	if (ioq->dispatched || ioq->nr_queued)
+		return;
+
+	/*
+	 * Anticipation is on. Don't expire queue. Either a new request will
+	 * come or it is up to io scheduler to expire the queue once idle
+	 * timer fires
+	 */
+
+	if(ioq_is_idling(ioq))
+		return;
+
+	/*
+	 * If IO scheduler denies expiration here, it is up to io scheduler
+	 * to expire the queue when possible. Otherwise all the idle time
+	 * will be charged to the queue when queue finally expires.
+	 */
+	if (elv_iosched_expire_ioq(q, 0, 0)) {
+		elv_log_ioq(efqd, ioq, "expire last empty queue");
+		elv_slice_expired(q);
+	}
+}
+
 /* A request got completed from io_queue. Do the accounting. */
 void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 {
@@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
 			elv_set_prio_slice(q->elevator->efqd, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
 		/*
 		 * If there is only root group present, don't expire the queue
 		 * for single queue ioschedulers (noop, deadline, AS). It is
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
-			elv_iosched_single_ioq(q->elevator)) {
-			elv_log_ioq(efqd, ioq, "select: only root group,"
-					" no expiry");
+		if (single_ioq_no_timed_expiry(q)) {
+			elv_mark_ioq_charge_one_slice(ioq);
+			elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
 			goto done;
 		}
 
@@ -3519,7 +3621,7 @@ void elv_ioq_completed_request(struct re
 		 * decide to idle on queue, idle on group.
 		 */
 		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
-		    && !timer_pending(&efqd->idle_slice_timer)) {
+		    && !ioq_is_idling(ioq)) {
 			/*
 			 * If queue has used up its slice, wait for the
 			 * one extra group_idle period to let the group
@@ -3532,17 +3634,7 @@ void elv_ioq_completed_request(struct re
 				elv_iog_arm_slice_timer(q, iog, 0);
 		}
 
-		/*
-		 * if this is only queue and it has completed all its requests
-		 * and has nothing to dispatch, expire it. We don't want to
-		 * keep it around idle otherwise later when it is expired, all
-		 * this idle time will be added to queue's disk time used.
-		 */
-		if (efqd->busy_queues == 1 && !ioq->dispatched &&
-		   !ioq->nr_queued && !timer_pending(&efqd->idle_slice_timer)) {
-			if (elv_iosched_expire_ioq(q, 0, 0))
-				elv_slice_expired(q);
-		}
+		check_expire_last_empty_queue(q, ioq);
 	}
 done:
 	if (!efqd->rq_in_driver)
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-09-14 23:13:08.000000000 -0400
@@ -187,6 +187,24 @@ static void as_antic_stop(struct as_data
 static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
 
 #ifdef CONFIG_IOSCHED_AS_HIER
+static int as_can_anticipate(struct as_data *ad, struct request *rq);
+static void as_antic_waitnext(struct as_data *ad);
+
+static inline void as_mark_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	elv_mark_ioq_wait_request(asq->ioq);
+}
+
+static inline void as_clear_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	if (asq)
+		elv_clear_ioq_wait_request(asq->ioq);
+}
+
 static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 {
 	/* Save batch data dir */
@@ -279,6 +297,29 @@ static void as_active_ioq_set(struct req
 }
 
 /*
+ * AS does not anticipate on a context if there is no other request pending.
+ * So if only a single sequential reader was running, AS will not turn on
+ * anticipation. This function turns on anticipation if an io context has
+ * think time with-in limits and there are no other requests to dispatch.
+ *
+ * With group scheduling, a queue is expired if is empty, does not have a
+ * request dispatched and we are not idling. In case of this single reader
+ * we will see a queue expiration after every request completion. Hence turn
+ * on the anticipation if an io context should ancipate and there are no
+ * other requests queued in the queue.
+ */
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+
+	if (!ad->nr_dispatched && !asq->nr_queued[1] && !asq->nr_queued[0] &&
+	    as_can_anticipate(ad, NULL)) {
+		as_antic_waitnext(ad);
+	}
+}
+
+/*
  * This is a notification from common layer that it wishes to expire this
  * io queue. AS decides whether queue can be expired, if yes, it also
  * saves the batch context.
@@ -325,13 +366,18 @@ static int as_expire_ioq(struct request_
 		goto keep_queue;
 
 	/*
-	 * If AS anticipation is ON, wait for it to finish.
+	 * If AS anticipation is ON, wait for it to finish if queue slice
+	 * has not expired.
 	 */
 	BUG_ON(status == ANTIC_WAIT_REQ);
 
-	if (status == ANTIC_WAIT_NEXT)
-		goto keep_queue;
-
+	if (status == ANTIC_WAIT_NEXT) {
+		if (!slice_expired)
+			goto keep_queue;
+		/* Slice expired. Stop anticipating. */
+		as_antic_stop(ad);
+		ad->antic_status = ANTIC_OFF;
+	}
 	/* We are good to expire the queue. Save batch context */
 	as_save_batch_context(ad, asq);
 	ad->switch_queue = 0;
@@ -342,6 +388,33 @@ keep_queue:
 	ad->switch_queue = 1;
 	return 0;
 }
+
+static inline void as_check_expire_active_as_queue(struct request_queue *q)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_active_sched_queue(q->elevator);
+
+	/*
+	 * We anticpated on the queue and timer fired. If queue is empty,
+	 * expire the queue. This will make sure an idle queue does not
+	 * remain active for a very long time as later all the idle time
+	 * can be added to the queue disk usage.
+	 */
+	if (asq) {
+		if (!ad->nr_dispatched && !asq->nr_queued[1] &&
+		    !asq->nr_queued[0]) {
+			ad->switch_queue = 0;
+			elv_ioq_slice_expired(q, asq->ioq);
+		}
+	}
+}
+
+#else /* CONFIG_IOSCHED_AS_HIER */
+static inline void as_mark_active_asq_wait_request(struct as_data *ad) {}
+static inline void as_clear_active_asq_wait_request(struct as_data *ad) {}
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq) {}
+static inline void as_check_expire_active_as_queue(struct request_queue *q) {}
 #endif
 
 /*
@@ -622,6 +695,7 @@ static void as_antic_waitnext(struct as_
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_mark_active_asq_wait_request(ad);
 	as_log(ad, "antic_waitnext set");
 }
 
@@ -656,6 +730,7 @@ static void as_antic_stop(struct as_data
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
+		as_clear_active_asq_wait_request(ad);
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
 		kblockd_schedule_work(ad->q, &ad->antic_work);
@@ -672,7 +747,7 @@ static void as_antic_timeout(unsigned lo
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	as_log(ad, "as_antic_timeout");
+	as_log(ad, "as_antic_timeout. antic_status=%d", ad->antic_status);
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -680,6 +755,9 @@ static void as_antic_timeout(unsigned lo
 		aic = ad->io_context->aic;
 
 		ad->antic_status = ANTIC_FINISHED;
+
+		as_clear_active_asq_wait_request(ad);
+		as_check_expire_active_as_queue(q);
 		kblockd_schedule_work(q, &ad->antic_work);
 
 		if (aic->ttime_samples == 0) {
@@ -690,6 +768,7 @@ static void as_antic_timeout(unsigned lo
 			/* process not "saved" by a cooperating request */
 			ad->exit_no_coop = (7*ad->exit_no_coop + 256)/8;
 		}
+
 		spin_unlock(&ad->io_context->lock);
 	}
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1122,7 +1201,8 @@ static void as_completed_request(struct 
 			 * the next one
 			 */
 			as_antic_waitnext(ad);
-		}
+		} else
+			as_hier_check_start_waitnext(q, asq);
 	}
 
 	as_put_io_context(rq);
@@ -1471,7 +1551,6 @@ static void as_add_request(struct reques
 	data_dir = rq_is_sync(rq);
 
 	rq->elevator_private = as_get_io_context(q->node);
-
 	asq->nr_queued[data_dir]++;
 	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-09-14 15:50:04.000000000 -0400
@@ -264,6 +264,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
+	ELV_QUEUE_FLAG_charge_one_slice,  /* Charge the queue for only one
+					   * time slice length */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -287,6 +289,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 ELV_IO_QUEUE_FLAG_FNS(must_expire)
+ELV_IO_QUEUE_FLAG_FNS(charge_one_slice)
 
 #ifdef CONFIG_GROUP_IOSCHED
 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
@ 2009-09-15  3:37             ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-15  3:37 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
> >>>> Hi Vivek,
> >>>>
> >>>> I happened to encount a bug when i test IO Controller V9.
> >>>> When there are three tasks to run concurrently in three group,
> >>>> that is, one is parent group, and other two tasks are running 
> >>>> in two different child groups respectively to read or write 
> >>>> files in some disk, say disk "hdb", The task may hang up, and 
> >>>> other tasks which access into "hdb" will also hang up.
> >>>>
> >>>> The bug only happens when using AS io scheduler.
> >>>> The following scirpt can reproduce this bug in my box.
> >>>>
> >>> Hi Gui,
> >>>
> >>> I tried reproducing this on my system and can't reproduce it. All the
> >>> three processes get killed and system does not hang.
> >>>
> >>> Can you please dig deeper a bit into it. 
> >>>
> >>> - If whole system hangs or it is just IO to disk seems to be hung.
> >>     Only when the task is trying do IO to disk it will hang up.
> >>
> >>> - Does io scheduler switch on the device work
> >>     yes, io scheduler can be switched, and the hung task will be resumed.
> >>
> >>> - If the system is not hung, can you capture the blktrace on the device.
> >>>   Trace might give some idea, what's happening.
> >> I run a "find" task to do some io on that disk, it seems that task hangs 
> >> when it is issuing getdents() syscall.
> >> kernel generates the following message:
> >>
> >> INFO: task find:3260 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> find          D a1e95787  1912  3260   2897 0x00000004
> >>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
> >>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
> >>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
> >> Call Trace:
> >>  [<c0447323>] ? getnstimeofday+0x57/0xe0
> >>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
> >>  [<c068ab68>] io_schedule+0x47/0x79
> >>  [<c04c12ee>] sync_buffer+0x36/0x3a
> >>  [<c068ae14>] __wait_on_bit+0x36/0x5d
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
> >>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
> >>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
> >>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
> >>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
> >>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
> >>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
> >>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
> >>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
> >>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
> >>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
> >>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
> >>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
> >>  [<c04b12db>] ? vfs_readdir+0x46/0x94
> >>  [<c04b12fd>] vfs_readdir+0x68/0x94
> >>  [<c04b1100>] ? filldir64+0x0/0xcd
> >>  [<c04b1387>] sys_getdents64+0x5e/0x9f
> >>  [<c04028b4>] sysenter_do_call+0x12/0x32
> >> 1 lock held by find/3260:
> >>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
> >>
> >> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
> >> state, and I found this task will be resumed after a quite long period(more than 10 mins).
> > 
> > Thanks Gui. As Jens said, it does look like a case of missing queue
> > restart somewhere and now we are stuck, no requests are being dispatched
> > to the disk and queue is already unplugged.
> > 
> > Can you please also try capturing the trace of events at io scheduler
> > (blktrace) to see how did we get into that situation.
> > 
> > Are you using ide drivers and not libata? As jens said, I will try to make
> > use of ide drivers and see if I can reproduce it.
> > 
> 
> Hi Vivek, Jens,
> 
> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
> is still under service, and from now on, this ioq won't expire because "only root" optimization.
> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Hi Gui,

I have modified your patch a bit to improve readability. Looking at the
issue closely I realized that this optimization of not expiring the 
queue can lead to other issues like high vdisktime in certain scenarios.
While fixing that also noticed the issue of high rate of as queue
expiration in certain cases which could have been avoided. 

Here is a patch which should fix all that. I am still testing this patch
to make sure that something is not obiviously broken. Will merge it if
there are no issues.

Thanks
Vivek

o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
  and fixed by Gui.

o If an AS queue is not expired for a long time and suddenly somebody
  decides to create a group and launch a job there, in that case old AS
  queue will be expired with a very high value of slice used and will get
  a very high disk time. Fix it by marking the queue as "charge_one_slice"
  and charge the queue only for a single time slice and not for whole
  of the duration when queue was running.

o There are cases where in case of AS, excessive queue expiration will take
  place by elevator fair queuing layer because of few reasons.
	- AS does not anticipate on a queue if there are no competing requests.
	  So if only a single reader is present in a group, anticipation does
	  not get turn on.

	- elevator layer does not know that As is anticipating hence initiates
	  expiry requests in select_ioq() thinking queue is empty.

	- elevaotr layer tries to aggressively expire last empty queue. This
	  can lead to lof of queue expiry

o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
  queue completed and associated io context is eligible to anticipate. Also
  AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
  . This solves above mentioned issues.
 
o Moved some of the code in separate functions to improve readability.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c  |   93 +++++++++++++++++++++++++++++---
 block/elevator-fq.c |  150 +++++++++++++++++++++++++++++++++++++++++-----------
 block/elevator-fq.h |    3 +
 3 files changed, 210 insertions(+), 36 deletions(-)

Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-09-14 23:09:08.000000000 -0400
@@ -472,19 +472,18 @@ static inline void debug_entity_vdisktim
 					unsigned long served, u64 delta) {}
 #endif /* DEBUG_ELV_FAIR_QUEUING */
 
-static void
-entity_served(struct io_entity *entity, unsigned long served,
-				unsigned long nr_sectors)
+static void entity_served(struct io_entity *entity, unsigned long real_served,
+		unsigned long virtual_served, unsigned long nr_sectors)
 {
 	for_each_entity(entity) {
 		u64 delta;
 
-		delta = elv_delta_fair(served, entity);
+		delta = elv_delta_fair(virtual_served, entity);
 		entity->vdisktime += delta;
 		update_min_vdisktime(entity->st);
-		entity->total_time += served;
+		entity->total_time += real_served;
 		entity->total_sectors += nr_sectors;
-		debug_entity_vdisktime(entity, served, delta);
+		debug_entity_vdisktime(entity, virtual_served, delta);
 	}
 }
 
@@ -928,7 +927,24 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+	unsigned long virtual_served = served, allocated_slice;
+
+	/*
+	 * For single ioq schedulers we don't expire the queue if there are
+	 * no other competing groups. It might happen that once a queue has
+	 * not been expired for a long time, suddenly a new group is created
+	 * and IO comes in that new group. In that case, we don't want to
+	 * charge the old queue for whole of the period it was not expired.
+	 */
+	if (elv_ioq_charge_one_slice(ioq)) {
+		allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);
+		if (served > allocated_slice)
+			virtual_served = allocated_slice;
+		elv_clear_ioq_charge_one_slice(ioq);
+	}
+
+	entity_served(&ioq->entity, served, virtual_served, ioq->nr_sectors);
 	elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
 			served, ioq->nr_sectors, ioq->nr_queued);
 	print_ioq_service_stats(ioq);
@@ -2543,6 +2559,22 @@ alloc_sched_q:
 		elv_init_ioq_io_group(ioq, iog);
 		elv_init_ioq_sched_queue(e, ioq, sched_q);
 
+		/*
+		 * For AS, also mark the group queue idle_window. This will
+		 * make sure that select_ioq() will not try to expire an
+		 * AS queue if there are dispatched request from the queue but
+		 * queue is empty. This gives a chance to asq to anticipate
+		 * after request completion, otherwise select_ioq() will
+		 * mark it must_expire and soon asq will be expired.
+		 *
+		 *  Not doing it for noop and deadline yet as they don't have
+		 *  any anticpation logic and this will slow down queue
+		 *  switching in a NCQ supporting hardware.
+		 */
+		if (!strcmp(e->elevator_type->elevator_name, "anticipatory")) {
+			elv_mark_ioq_idle_window(ioq);
+		}
+
 		elv_io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 		elv_get_iog(iog);
@@ -2664,6 +2696,12 @@ static inline int is_only_root_group(voi
 
 #endif /* CONFIG_GROUP_IOSCHED */
 
+static inline int ioq_is_idling(struct io_queue *ioq)
+{
+	return (elv_ioq_wait_request(ioq) ||
+			timer_pending(&ioq->efqd->idle_slice_timer));
+}
+
 /*
  * Should be called after ioq prio and class has been initialized as prio
  * class data will be used to determine which service tree in the group
@@ -2835,7 +2873,6 @@ elv_iosched_expire_ioq(struct request_qu
 		if (!ret)
 			elv_mark_ioq_must_expire(ioq);
 	}
-
 	return ret;
 }
 
@@ -3078,6 +3115,7 @@ void elv_ioq_request_add(struct request_
 		 */
 		if (group_wait_req || elv_ioq_wait_request(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_request(ioq);
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1 || !blk_queue_plugged(q))
 				__blk_run_queue(q);
@@ -3121,6 +3159,7 @@ static void elv_idle_slice_timer(unsigne
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 
+		elv_clear_ioq_wait_request(ioq);
 		elv_clear_iog_wait_request(iog);
 
 		if (elv_iog_wait_busy(iog)) {
@@ -3222,6 +3261,28 @@ static inline struct io_queue *elv_close
 	return new_ioq;
 }
 
+/*
+ * One can do some optimizations for single ioq scheduler, when one does
+ * not have to expire the queue after every time slice is used. This avoids
+ * some unnecessary overhead, especially in AS where we wait for requests to
+ * finish from last queue before new queue is scheduled in.
+ */
+static inline int single_ioq_no_timed_expiry(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (!elv_iosched_single_ioq(q->elevator))
+		return 0;
+
+	if (!is_only_root_group())
+		return 0;
+
+	if (efqd->busy_queues == 1)
+		return 1;
+
+	return 0;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_select_ioq(struct request_queue *q, int force)
 {
@@ -3229,7 +3290,7 @@ void *elv_select_ioq(struct request_queu
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
  	struct elevator_type *e = q->elevator->elevator_type;
- 	int slice_expired = 1;
+ 	int slice_expired = 0;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3255,16 +3316,20 @@ void *elv_select_ioq(struct request_queu
 	}
 
 	/* This queue has been marked for expiry. Try to expire it */
-	if (elv_ioq_must_expire(ioq))
+	if (elv_ioq_must_expire(ioq)) {
+		elv_log_ioq(efqd, ioq, "select: ioq must_expire. expire");
 		goto expire;
+	}
 
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS).
 	 */
 
-	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator))
+	if (single_ioq_no_timed_expiry(q)) {
+		elv_mark_ioq_charge_one_slice(ioq);
 		goto keep_queue;
+	}
 
 	/* We are waiting for this group to become busy before it expires.*/
 	if (elv_iog_wait_busy(iog)) {
@@ -3301,6 +3366,7 @@ void *elv_select_ioq(struct request_queu
 		 * from queue and is not proportional to group's weight, it
 		 * harms the fairness of the group.
 		 */
+		slice_expired = 1;
 		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
 			ioq = NULL;
 			goto keep_queue;
@@ -3332,7 +3398,7 @@ void *elv_select_ioq(struct request_queu
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
 
-	if (timer_pending(&efqd->idle_slice_timer) ||
+	if (ioq_is_idling(ioq) ||
 	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
 		ioq = NULL;
 		goto keep_queue;
@@ -3344,7 +3410,6 @@ void *elv_select_ioq(struct request_queu
 		goto keep_queue;
 	}
 
-	slice_expired = 0;
 expire:
  	if (efqd->fairness && !force && ioq && ioq->dispatched
  	    && strcmp(e->elevator_name, "anticipatory")) {
@@ -3439,6 +3504,43 @@ void elv_deactivate_rq_fair(struct reque
 						efqd->rq_in_driver);
 }
 
+/*
+ * if this is only queue and it has completed all its requests and has nothing
+ * to dispatch, expire it. We don't want to keep it around idle otherwise later
+ * when it is expired, all this idle time will be added to queue's disk time
+ * used and queue might not get a chance to run for a long time.
+ */
+static inline void
+check_expire_last_empty_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	if (efqd->busy_queues != 1)
+		return;
+
+	if (ioq->dispatched || ioq->nr_queued)
+		return;
+
+	/*
+	 * Anticipation is on. Don't expire queue. Either a new request will
+	 * come or it is up to io scheduler to expire the queue once idle
+	 * timer fires
+	 */
+
+	if(ioq_is_idling(ioq))
+		return;
+
+	/*
+	 * If IO scheduler denies expiration here, it is up to io scheduler
+	 * to expire the queue when possible. Otherwise all the idle time
+	 * will be charged to the queue when queue finally expires.
+	 */
+	if (elv_iosched_expire_ioq(q, 0, 0)) {
+		elv_log_ioq(efqd, ioq, "expire last empty queue");
+		elv_slice_expired(q);
+	}
+}
+
 /* A request got completed from io_queue. Do the accounting. */
 void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 {
@@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
 			elv_set_prio_slice(q->elevator->efqd, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
 		/*
 		 * If there is only root group present, don't expire the queue
 		 * for single queue ioschedulers (noop, deadline, AS). It is
 		 * unnecessary overhead.
 		 */
 
-		if (is_only_root_group() &&
-			elv_iosched_single_ioq(q->elevator)) {
-			elv_log_ioq(efqd, ioq, "select: only root group,"
-					" no expiry");
+		if (single_ioq_no_timed_expiry(q)) {
+			elv_mark_ioq_charge_one_slice(ioq);
+			elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
 			goto done;
 		}
 
@@ -3519,7 +3621,7 @@ void elv_ioq_completed_request(struct re
 		 * decide to idle on queue, idle on group.
 		 */
 		if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
-		    && !timer_pending(&efqd->idle_slice_timer)) {
+		    && !ioq_is_idling(ioq)) {
 			/*
 			 * If queue has used up its slice, wait for the
 			 * one extra group_idle period to let the group
@@ -3532,17 +3634,7 @@ void elv_ioq_completed_request(struct re
 				elv_iog_arm_slice_timer(q, iog, 0);
 		}
 
-		/*
-		 * if this is only queue and it has completed all its requests
-		 * and has nothing to dispatch, expire it. We don't want to
-		 * keep it around idle otherwise later when it is expired, all
-		 * this idle time will be added to queue's disk time used.
-		 */
-		if (efqd->busy_queues == 1 && !ioq->dispatched &&
-		   !ioq->nr_queued && !timer_pending(&efqd->idle_slice_timer)) {
-			if (elv_iosched_expire_ioq(q, 0, 0))
-				elv_slice_expired(q);
-		}
+		check_expire_last_empty_queue(q, ioq);
 	}
 done:
 	if (!efqd->rq_in_driver)
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-09-14 23:13:08.000000000 -0400
@@ -187,6 +187,24 @@ static void as_antic_stop(struct as_data
 static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
 
 #ifdef CONFIG_IOSCHED_AS_HIER
+static int as_can_anticipate(struct as_data *ad, struct request *rq);
+static void as_antic_waitnext(struct as_data *ad);
+
+static inline void as_mark_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	elv_mark_ioq_wait_request(asq->ioq);
+}
+
+static inline void as_clear_active_asq_wait_request(struct as_data *ad)
+{
+	struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+	if (asq)
+		elv_clear_ioq_wait_request(asq->ioq);
+}
+
 static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 {
 	/* Save batch data dir */
@@ -279,6 +297,29 @@ static void as_active_ioq_set(struct req
 }
 
 /*
+ * AS does not anticipate on a context if there is no other request pending.
+ * So if only a single sequential reader was running, AS will not turn on
+ * anticipation. This function turns on anticipation if an io context has
+ * think time with-in limits and there are no other requests to dispatch.
+ *
+ * With group scheduling, a queue is expired if is empty, does not have a
+ * request dispatched and we are not idling. In case of this single reader
+ * we will see a queue expiration after every request completion. Hence turn
+ * on the anticipation if an io context should ancipate and there are no
+ * other requests queued in the queue.
+ */
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+
+	if (!ad->nr_dispatched && !asq->nr_queued[1] && !asq->nr_queued[0] &&
+	    as_can_anticipate(ad, NULL)) {
+		as_antic_waitnext(ad);
+	}
+}
+
+/*
  * This is a notification from common layer that it wishes to expire this
  * io queue. AS decides whether queue can be expired, if yes, it also
  * saves the batch context.
@@ -325,13 +366,18 @@ static int as_expire_ioq(struct request_
 		goto keep_queue;
 
 	/*
-	 * If AS anticipation is ON, wait for it to finish.
+	 * If AS anticipation is ON, wait for it to finish if queue slice
+	 * has not expired.
 	 */
 	BUG_ON(status == ANTIC_WAIT_REQ);
 
-	if (status == ANTIC_WAIT_NEXT)
-		goto keep_queue;
-
+	if (status == ANTIC_WAIT_NEXT) {
+		if (!slice_expired)
+			goto keep_queue;
+		/* Slice expired. Stop anticipating. */
+		as_antic_stop(ad);
+		ad->antic_status = ANTIC_OFF;
+	}
 	/* We are good to expire the queue. Save batch context */
 	as_save_batch_context(ad, asq);
 	ad->switch_queue = 0;
@@ -342,6 +388,33 @@ keep_queue:
 	ad->switch_queue = 1;
 	return 0;
 }
+
+static inline void as_check_expire_active_as_queue(struct request_queue *q)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_active_sched_queue(q->elevator);
+
+	/*
+	 * We anticpated on the queue and timer fired. If queue is empty,
+	 * expire the queue. This will make sure an idle queue does not
+	 * remain active for a very long time as later all the idle time
+	 * can be added to the queue disk usage.
+	 */
+	if (asq) {
+		if (!ad->nr_dispatched && !asq->nr_queued[1] &&
+		    !asq->nr_queued[0]) {
+			ad->switch_queue = 0;
+			elv_ioq_slice_expired(q, asq->ioq);
+		}
+	}
+}
+
+#else /* CONFIG_IOSCHED_AS_HIER */
+static inline void as_mark_active_asq_wait_request(struct as_data *ad) {}
+static inline void as_clear_active_asq_wait_request(struct as_data *ad) {}
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq) {}
+static inline void as_check_expire_active_as_queue(struct request_queue *q) {}
 #endif
 
 /*
@@ -622,6 +695,7 @@ static void as_antic_waitnext(struct as_
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_mark_active_asq_wait_request(ad);
 	as_log(ad, "antic_waitnext set");
 }
 
@@ -656,6 +730,7 @@ static void as_antic_stop(struct as_data
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
+		as_clear_active_asq_wait_request(ad);
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
 		kblockd_schedule_work(ad->q, &ad->antic_work);
@@ -672,7 +747,7 @@ static void as_antic_timeout(unsigned lo
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	as_log(ad, "as_antic_timeout");
+	as_log(ad, "as_antic_timeout. antic_status=%d", ad->antic_status);
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -680,6 +755,9 @@ static void as_antic_timeout(unsigned lo
 		aic = ad->io_context->aic;
 
 		ad->antic_status = ANTIC_FINISHED;
+
+		as_clear_active_asq_wait_request(ad);
+		as_check_expire_active_as_queue(q);
 		kblockd_schedule_work(q, &ad->antic_work);
 
 		if (aic->ttime_samples == 0) {
@@ -690,6 +768,7 @@ static void as_antic_timeout(unsigned lo
 			/* process not "saved" by a cooperating request */
 			ad->exit_no_coop = (7*ad->exit_no_coop + 256)/8;
 		}
+
 		spin_unlock(&ad->io_context->lock);
 	}
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1122,7 +1201,8 @@ static void as_completed_request(struct 
 			 * the next one
 			 */
 			as_antic_waitnext(ad);
-		}
+		} else
+			as_hier_check_start_waitnext(q, asq);
 	}
 
 	as_put_io_context(rq);
@@ -1471,7 +1551,6 @@ static void as_add_request(struct reques
 	data_dir = rq_is_sync(rq);
 
 	rq->elevator_private = as_get_io_context(q->node);
-
 	asq->nr_queued[data_dir]++;
 	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-09-14 15:45:58.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-09-14 15:50:04.000000000 -0400
@@ -264,6 +264,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
 	ELV_QUEUE_FLAG_must_expire,       /* expire queue even slice is left */
+	ELV_QUEUE_FLAG_charge_one_slice,  /* Charge the queue for only one
+					   * time slice length */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -287,6 +289,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(sync)
 ELV_IO_QUEUE_FLAG_FNS(must_expire)
+ELV_IO_QUEUE_FLAG_FNS(charge_one_slice)
 
 #ifdef CONFIG_GROUP_IOSCHED
 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]             ` <20090915033739.GA4054-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-16  0:05               ` Gui Jianfeng
  2009-09-16  2:58               ` Gui Jianfeng
  2009-09-24  1:10               ` Gui Jianfeng
  2 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-16  0:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running 
>>>>>> in two different child groups respectively to read or write 
>>>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it. 
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>>     Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>>   Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs 
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find          D a1e95787  1912  3260   2897 0x00000004
>>>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>>  [<c068ab68>] io_schedule+0x47/0x79
>>>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> 
> Hi Gui,
> 
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the 
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided. 
> 
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
> 
> Thanks
> Vivek
> 
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>   and fixed by Gui.
> 
> o If an AS queue is not expired for a long time and suddenly somebody
>   decides to create a group and launch a job there, in that case old AS
>   queue will be expired with a very high value of slice used and will get
>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>   and charge the queue only for a single time slice and not for whole
>   of the duration when queue was running.
> 
> o There are cases where in case of AS, excessive queue expiration will take
>   place by elevator fair queuing layer because of few reasons.
> 	- AS does not anticipate on a queue if there are no competing requests.
> 	  So if only a single reader is present in a group, anticipation does
> 	  not get turn on.
> 
> 	- elevator layer does not know that As is anticipating hence initiates
> 	  expiry requests in select_ioq() thinking queue is empty.
> 
> 	- elevaotr layer tries to aggressively expire last empty queue. This
> 	  can lead to lof of queue expiry
> 
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>   queue completed and associated io context is eligible to anticipate. Also
>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>   . This solves above mentioned issues.
>  
> o Moved some of the code in separate functions to improve readability.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

  I'd like to have a try this patch :)

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-15  3:37             ` Vivek Goyal
@ 2009-09-16  0:05               ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-16  0:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running 
>>>>>> in two different child groups respectively to read or write 
>>>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it. 
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>>     Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>>   Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs 
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find          D a1e95787  1912  3260   2897 0x00000004
>>>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>>  [<c068ab68>] io_schedule+0x47/0x79
>>>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> 
> Hi Gui,
> 
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the 
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided. 
> 
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
> 
> Thanks
> Vivek
> 
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>   and fixed by Gui.
> 
> o If an AS queue is not expired for a long time and suddenly somebody
>   decides to create a group and launch a job there, in that case old AS
>   queue will be expired with a very high value of slice used and will get
>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>   and charge the queue only for a single time slice and not for whole
>   of the duration when queue was running.
> 
> o There are cases where in case of AS, excessive queue expiration will take
>   place by elevator fair queuing layer because of few reasons.
> 	- AS does not anticipate on a queue if there are no competing requests.
> 	  So if only a single reader is present in a group, anticipation does
> 	  not get turn on.
> 
> 	- elevator layer does not know that As is anticipating hence initiates
> 	  expiry requests in select_ioq() thinking queue is empty.
> 
> 	- elevaotr layer tries to aggressively expire last empty queue. This
> 	  can lead to lof of queue expiry
> 
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>   queue completed and associated io context is eligible to anticipate. Also
>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>   . This solves above mentioned issues.
>  
> o Moved some of the code in separate functions to improve readability.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

  I'd like to have a try this patch :)

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
@ 2009-09-16  0:05               ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-16  0:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running 
>>>>>> in two different child groups respectively to read or write 
>>>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it. 
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>>     Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>>   Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs 
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find          D a1e95787  1912  3260   2897 0x00000004
>>>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>>  [<c068ab68>] io_schedule+0x47/0x79
>>>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> 
> Hi Gui,
> 
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the 
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided. 
> 
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
> 
> Thanks
> Vivek
> 
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>   and fixed by Gui.
> 
> o If an AS queue is not expired for a long time and suddenly somebody
>   decides to create a group and launch a job there, in that case old AS
>   queue will be expired with a very high value of slice used and will get
>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>   and charge the queue only for a single time slice and not for whole
>   of the duration when queue was running.
> 
> o There are cases where in case of AS, excessive queue expiration will take
>   place by elevator fair queuing layer because of few reasons.
> 	- AS does not anticipate on a queue if there are no competing requests.
> 	  So if only a single reader is present in a group, anticipation does
> 	  not get turn on.
> 
> 	- elevator layer does not know that As is anticipating hence initiates
> 	  expiry requests in select_ioq() thinking queue is empty.
> 
> 	- elevaotr layer tries to aggressively expire last empty queue. This
> 	  can lead to lof of queue expiry
> 
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>   queue completed and associated io context is eligible to anticipate. Also
>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>   . This solves above mentioned issues.
>  
> o Moved some of the code in separate functions to improve readability.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

  I'd like to have a try this patch :)

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]             ` <20090915033739.GA4054-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-16  0:05               ` Gui Jianfeng
@ 2009-09-16  2:58               ` Gui Jianfeng
  2009-09-24  1:10               ` Gui Jianfeng
  2 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-16  2:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running 
>>>>>> in two different child groups respectively to read or write 
>>>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it. 
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>>     Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>>   Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs 
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find          D a1e95787  1912  3260   2897 0x00000004
>>>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>>  [<c068ab68>] io_schedule+0x47/0x79
>>>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> 
> Hi Gui,
> 
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the 
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided. 
> 
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
> 
> Thanks
> Vivek
> 
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>   and fixed by Gui.
> 
> o If an AS queue is not expired for a long time and suddenly somebody
>   decides to create a group and launch a job there, in that case old AS
>   queue will be expired with a very high value of slice used and will get
>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>   and charge the queue only for a single time slice and not for whole
>   of the duration when queue was running.
> 
> o There are cases where in case of AS, excessive queue expiration will take
>   place by elevator fair queuing layer because of few reasons.
> 	- AS does not anticipate on a queue if there are no competing requests.
> 	  So if only a single reader is present in a group, anticipation does
> 	  not get turn on.
> 
> 	- elevator layer does not know that As is anticipating hence initiates
> 	  expiry requests in select_ioq() thinking queue is empty.
> 
> 	- elevaotr layer tries to aggressively expire last empty queue. This
> 	  can lead to lof of queue expiry
> 
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>   queue completed and associated io context is eligible to anticipate. Also
>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>   . This solves above mentioned issues.
>  
> o Moved some of the code in separate functions to improve readability.
> 
...

>  /* A request got completed from io_queue. Do the accounting. */
>  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>  {
> @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
>  			elv_set_prio_slice(q->elevator->efqd, ioq);
>  			elv_clear_ioq_slice_new(ioq);
>  		}
> +
>  		/*
>  		 * If there is only root group present, don't expire the queue
>  		 * for single queue ioschedulers (noop, deadline, AS). It is
>  		 * unnecessary overhead.
>  		 */
>  
> -		if (is_only_root_group() &&
> -			elv_iosched_single_ioq(q->elevator)) {
> -			elv_log_ioq(efqd, ioq, "select: only root group,"
> -					" no expiry");
> +		if (single_ioq_no_timed_expiry(q)) {

  Hi Vivek,

  So we make use of single_ioq_no_timed_expiry() to decide whether there is only
  root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
  the root cgroup is the only group and if there is only one busy_ioq there. As
  I explained in previous mail, these two checks are not sufficient to say the
  current active ioq comes from root group. Because when the child cgroup is just
  removed, and the ioq which belongs to child group is still there(maybe some
  requests are in flight). In this case, only root cgroup and only one active ioq
  (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
  still need to check "efqd->root_group->ioq" is already created to ensure the only
  ioq comes from root group. Am i missing something?


> +			elv_mark_ioq_charge_one_slice(ioq);
> +			elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
>  			goto done;
>  		}
>  

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-15  3:37             ` Vivek Goyal
                               ` (2 preceding siblings ...)
  (?)
@ 2009-09-16  2:58             ` Gui Jianfeng
  2009-09-16 18:09                 ` Vivek Goyal
       [not found]               ` <4AB05442.6080004-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 2 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-16  2:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running 
>>>>>> in two different child groups respectively to read or write 
>>>>>> files in some disk, say disk "hdb", The task may hang up, and 
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it. 
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>>     Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>>     yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>>   Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs 
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find          D a1e95787  1912  3260   2897 0x00000004
>>>>  f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>>  00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>>  0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>>  [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>>  [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>>  [<c068ab68>] io_schedule+0x47/0x79
>>>>  [<c04c12ee>] sync_buffer+0x36/0x3a
>>>>  [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>>  [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>>  [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>>  [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>>  [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>>  [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>>  [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>>  [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>>  [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>>  [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>>  [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>>  [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>>  [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>>  [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>>  [<c04b12fd>] vfs_readdir+0x68/0x94
>>>>  [<c04b1100>] ? filldir64+0x0/0xcd
>>>>  [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>>  [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>>  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> 
> Hi Gui,
> 
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the 
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided. 
> 
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
> 
> Thanks
> Vivek
> 
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>   and fixed by Gui.
> 
> o If an AS queue is not expired for a long time and suddenly somebody
>   decides to create a group and launch a job there, in that case old AS
>   queue will be expired with a very high value of slice used and will get
>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>   and charge the queue only for a single time slice and not for whole
>   of the duration when queue was running.
> 
> o There are cases where in case of AS, excessive queue expiration will take
>   place by elevator fair queuing layer because of few reasons.
> 	- AS does not anticipate on a queue if there are no competing requests.
> 	  So if only a single reader is present in a group, anticipation does
> 	  not get turn on.
> 
> 	- elevator layer does not know that As is anticipating hence initiates
> 	  expiry requests in select_ioq() thinking queue is empty.
> 
> 	- elevaotr layer tries to aggressively expire last empty queue. This
> 	  can lead to lof of queue expiry
> 
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>   queue completed and associated io context is eligible to anticipate. Also
>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>   . This solves above mentioned issues.
>  
> o Moved some of the code in separate functions to improve readability.
> 
...

>  /* A request got completed from io_queue. Do the accounting. */
>  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>  {
> @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
>  			elv_set_prio_slice(q->elevator->efqd, ioq);
>  			elv_clear_ioq_slice_new(ioq);
>  		}
> +
>  		/*
>  		 * If there is only root group present, don't expire the queue
>  		 * for single queue ioschedulers (noop, deadline, AS). It is
>  		 * unnecessary overhead.
>  		 */
>  
> -		if (is_only_root_group() &&
> -			elv_iosched_single_ioq(q->elevator)) {
> -			elv_log_ioq(efqd, ioq, "select: only root group,"
> -					" no expiry");
> +		if (single_ioq_no_timed_expiry(q)) {

  Hi Vivek,

  So we make use of single_ioq_no_timed_expiry() to decide whether there is only
  root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
  the root cgroup is the only group and if there is only one busy_ioq there. As
  I explained in previous mail, these two checks are not sufficient to say the
  current active ioq comes from root group. Because when the child cgroup is just
  removed, and the ioq which belongs to child group is still there(maybe some
  requests are in flight). In this case, only root cgroup and only one active ioq
  (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
  still need to check "efqd->root_group->ioq" is already created to ensure the only
  ioq comes from root group. Am i missing something?


> +			elv_mark_ioq_charge_one_slice(ioq);
> +			elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
>  			goto done;
>  		}
>  

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]               ` <4AB05442.6080004-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-16 18:09                 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:

[..]
> > o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
> >   and fixed by Gui.
> > 
> > o If an AS queue is not expired for a long time and suddenly somebody
> >   decides to create a group and launch a job there, in that case old AS
> >   queue will be expired with a very high value of slice used and will get
> >   a very high disk time. Fix it by marking the queue as "charge_one_slice"
> >   and charge the queue only for a single time slice and not for whole
> >   of the duration when queue was running.
> > 
> > o There are cases where in case of AS, excessive queue expiration will take
> >   place by elevator fair queuing layer because of few reasons.
> > 	- AS does not anticipate on a queue if there are no competing requests.
> > 	  So if only a single reader is present in a group, anticipation does
> > 	  not get turn on.
> > 
> > 	- elevator layer does not know that As is anticipating hence initiates
> > 	  expiry requests in select_ioq() thinking queue is empty.
> > 
> > 	- elevaotr layer tries to aggressively expire last empty queue. This
> > 	  can lead to lof of queue expiry
> > 
> > o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
> >   queue completed and associated io context is eligible to anticipate. Also
> >   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
> >   . This solves above mentioned issues.
> >  
> > o Moved some of the code in separate functions to improve readability.
> > 
> ...
> 
> >  /* A request got completed from io_queue. Do the accounting. */
> >  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >  {
> > @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
> >  			elv_set_prio_slice(q->elevator->efqd, ioq);
> >  			elv_clear_ioq_slice_new(ioq);
> >  		}
> > +
> >  		/*
> >  		 * If there is only root group present, don't expire the queue
> >  		 * for single queue ioschedulers (noop, deadline, AS). It is
> >  		 * unnecessary overhead.
> >  		 */
> >  
> > -		if (is_only_root_group() &&
> > -			elv_iosched_single_ioq(q->elevator)) {
> > -			elv_log_ioq(efqd, ioq, "select: only root group,"
> > -					" no expiry");
> > +		if (single_ioq_no_timed_expiry(q)) {
> 
>   Hi Vivek,
> 
>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>   the root cgroup is the only group and if there is only one busy_ioq there. As
>   I explained in previous mail, these two checks are not sufficient to say the
>   current active ioq comes from root group. Because when the child cgroup is just
>   removed, and the ioq which belongs to child group is still there(maybe some
>   requests are in flight). In this case, only root cgroup and only one active ioq
>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>   ioq comes from root group. Am i missing something?
> 

Hi Gui,

The only side effect of not checking for "efqd->root_group->ioq" seems to
be that this ioq hence io group of the child will not be freed immediately
and release will be delayed until a some other queue in the system gets
backlogged or ioscheduler exits. At that point of time, this child queue will
be expired and ioq and iog will be freed.

The advantage is that if there is IO happening only in child group in that
case we can avoid expiring that queue.

So may be keeping the queue around for sometime is not a very idea. 

Am I missing something?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-16  2:58             ` Gui Jianfeng
@ 2009-09-16 18:09                 ` Vivek Goyal
       [not found]               ` <4AB05442.6080004-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:

[..]
> > o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
> >   and fixed by Gui.
> > 
> > o If an AS queue is not expired for a long time and suddenly somebody
> >   decides to create a group and launch a job there, in that case old AS
> >   queue will be expired with a very high value of slice used and will get
> >   a very high disk time. Fix it by marking the queue as "charge_one_slice"
> >   and charge the queue only for a single time slice and not for whole
> >   of the duration when queue was running.
> > 
> > o There are cases where in case of AS, excessive queue expiration will take
> >   place by elevator fair queuing layer because of few reasons.
> > 	- AS does not anticipate on a queue if there are no competing requests.
> > 	  So if only a single reader is present in a group, anticipation does
> > 	  not get turn on.
> > 
> > 	- elevator layer does not know that As is anticipating hence initiates
> > 	  expiry requests in select_ioq() thinking queue is empty.
> > 
> > 	- elevaotr layer tries to aggressively expire last empty queue. This
> > 	  can lead to lof of queue expiry
> > 
> > o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
> >   queue completed and associated io context is eligible to anticipate. Also
> >   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
> >   . This solves above mentioned issues.
> >  
> > o Moved some of the code in separate functions to improve readability.
> > 
> ...
> 
> >  /* A request got completed from io_queue. Do the accounting. */
> >  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >  {
> > @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
> >  			elv_set_prio_slice(q->elevator->efqd, ioq);
> >  			elv_clear_ioq_slice_new(ioq);
> >  		}
> > +
> >  		/*
> >  		 * If there is only root group present, don't expire the queue
> >  		 * for single queue ioschedulers (noop, deadline, AS). It is
> >  		 * unnecessary overhead.
> >  		 */
> >  
> > -		if (is_only_root_group() &&
> > -			elv_iosched_single_ioq(q->elevator)) {
> > -			elv_log_ioq(efqd, ioq, "select: only root group,"
> > -					" no expiry");
> > +		if (single_ioq_no_timed_expiry(q)) {
> 
>   Hi Vivek,
> 
>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>   the root cgroup is the only group and if there is only one busy_ioq there. As
>   I explained in previous mail, these two checks are not sufficient to say the
>   current active ioq comes from root group. Because when the child cgroup is just
>   removed, and the ioq which belongs to child group is still there(maybe some
>   requests are in flight). In this case, only root cgroup and only one active ioq
>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>   ioq comes from root group. Am i missing something?
> 

Hi Gui,

The only side effect of not checking for "efqd->root_group->ioq" seems to
be that this ioq hence io group of the child will not be freed immediately
and release will be delayed until a some other queue in the system gets
backlogged or ioscheduler exits. At that point of time, this child queue will
be expired and ioq and iog will be freed.

The advantage is that if there is IO happening only in child group in that
case we can avoid expiring that queue.

So may be keeping the queue around for sometime is not a very idea. 

Am I missing something?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
@ 2009-09-16 18:09                 ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:

[..]
> > o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
> >   and fixed by Gui.
> > 
> > o If an AS queue is not expired for a long time and suddenly somebody
> >   decides to create a group and launch a job there, in that case old AS
> >   queue will be expired with a very high value of slice used and will get
> >   a very high disk time. Fix it by marking the queue as "charge_one_slice"
> >   and charge the queue only for a single time slice and not for whole
> >   of the duration when queue was running.
> > 
> > o There are cases where in case of AS, excessive queue expiration will take
> >   place by elevator fair queuing layer because of few reasons.
> > 	- AS does not anticipate on a queue if there are no competing requests.
> > 	  So if only a single reader is present in a group, anticipation does
> > 	  not get turn on.
> > 
> > 	- elevator layer does not know that As is anticipating hence initiates
> > 	  expiry requests in select_ioq() thinking queue is empty.
> > 
> > 	- elevaotr layer tries to aggressively expire last empty queue. This
> > 	  can lead to lof of queue expiry
> > 
> > o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
> >   queue completed and associated io context is eligible to anticipate. Also
> >   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
> >   . This solves above mentioned issues.
> >  
> > o Moved some of the code in separate functions to improve readability.
> > 
> ...
> 
> >  /* A request got completed from io_queue. Do the accounting. */
> >  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >  {
> > @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
> >  			elv_set_prio_slice(q->elevator->efqd, ioq);
> >  			elv_clear_ioq_slice_new(ioq);
> >  		}
> > +
> >  		/*
> >  		 * If there is only root group present, don't expire the queue
> >  		 * for single queue ioschedulers (noop, deadline, AS). It is
> >  		 * unnecessary overhead.
> >  		 */
> >  
> > -		if (is_only_root_group() &&
> > -			elv_iosched_single_ioq(q->elevator)) {
> > -			elv_log_ioq(efqd, ioq, "select: only root group,"
> > -					" no expiry");
> > +		if (single_ioq_no_timed_expiry(q)) {
> 
>   Hi Vivek,
> 
>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>   the root cgroup is the only group and if there is only one busy_ioq there. As
>   I explained in previous mail, these two checks are not sufficient to say the
>   current active ioq comes from root group. Because when the child cgroup is just
>   removed, and the ioq which belongs to child group is still there(maybe some
>   requests are in flight). In this case, only root cgroup and only one active ioq
>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>   ioq comes from root group. Am i missing something?
> 

Hi Gui,

The only side effect of not checking for "efqd->root_group->ioq" seems to
be that this ioq hence io group of the child will not be freed immediately
and release will be delayed until a some other queue in the system gets
backlogged or ioscheduler exits. At that point of time, this child queue will
be expired and ioq and iog will be freed.

The advantage is that if there is IO happening only in child group in that
case we can avoid expiring that queue.

So may be keeping the queue around for sometime is not a very idea. 

Am I missing something?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
       [not found]     ` <e98e18940909141133m5186b780r3215ce15141e4f87-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-16 18:47       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:47 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 14, 2009 at 11:33:37AM -0700, Nauman Rafique wrote:

[..]
> >  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
> >  {
> >        struct request *rq;
> > +       struct request_list *rl;
> >
> >        BUG_ON(rw != READ && rw != WRITE);
> >
> >        spin_lock_irq(q->queue_lock);
> > +       rl = blk_get_request_list(q, NULL);
> >        if (gfp_mask & __GFP_WAIT) {
> >                rq = get_request_wait(q, rw, NULL);
> >        } else {
> > -               rq = get_request(q, rw, NULL, gfp_mask);
> > +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
> >                if (!rq)
> >                        spin_unlock_irq(q->queue_lock);
> >        }
> > @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> >        if (req->cmd_flags & REQ_ALLOCED) {
> >                int is_sync = rq_is_sync(req) != 0;
> >                int priv = req->cmd_flags & REQ_ELVPRIV;
> > +               struct request_list *rl = rq_rl(q, req);
> >
> >                BUG_ON(!list_empty(&req->queuelist));
> >                BUG_ON(!hlist_unhashed(&req->hash));
> >
> >                blk_free_request(q, req);
> > -               freed_request(q, is_sync, priv);
> > +               freed_request(q, is_sync, priv, rl);
> 
> We have a potential memory bug here. freed_request should be called
> before blk_free_request as blk_free_request might result in release of
> cgroup, and request_list. Calling freed_request after blk_free_request
> would result in operations on freed memory.
> 

Good point Nauman. freeing rq will drop reference on the io queue. Which
in turn will drop reference on io group and if associated cgroup is
already gone, then io group will be freed hence request list pointer is
no more valid and any operation on that is bad.

So either we can take a reference on the queue, call free_request() and
then drop the reference or we can call freed_request() before
blk_free_request(). Calling freed_request() before blk_free_request()
sounds better to me. The only thing is that this function name and other
dependent function names should now be freeing_request() instead of
freed_request(). :-)

Will move freed_request() before blk_free_request() in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
  2009-09-14 18:33     ` Nauman Rafique
@ 2009-09-16 18:47       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:47 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Mon, Sep 14, 2009 at 11:33:37AM -0700, Nauman Rafique wrote:

[..]
> >  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
> >  {
> >        struct request *rq;
> > +       struct request_list *rl;
> >
> >        BUG_ON(rw != READ && rw != WRITE);
> >
> >        spin_lock_irq(q->queue_lock);
> > +       rl = blk_get_request_list(q, NULL);
> >        if (gfp_mask & __GFP_WAIT) {
> >                rq = get_request_wait(q, rw, NULL);
> >        } else {
> > -               rq = get_request(q, rw, NULL, gfp_mask);
> > +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
> >                if (!rq)
> >                        spin_unlock_irq(q->queue_lock);
> >        }
> > @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> >        if (req->cmd_flags & REQ_ALLOCED) {
> >                int is_sync = rq_is_sync(req) != 0;
> >                int priv = req->cmd_flags & REQ_ELVPRIV;
> > +               struct request_list *rl = rq_rl(q, req);
> >
> >                BUG_ON(!list_empty(&req->queuelist));
> >                BUG_ON(!hlist_unhashed(&req->hash));
> >
> >                blk_free_request(q, req);
> > -               freed_request(q, is_sync, priv);
> > +               freed_request(q, is_sync, priv, rl);
> 
> We have a potential memory bug here. freed_request should be called
> before blk_free_request as blk_free_request might result in release of
> cgroup, and request_list. Calling freed_request after blk_free_request
> would result in operations on freed memory.
> 

Good point Nauman. freeing rq will drop reference on the io queue. Which
in turn will drop reference on io group and if associated cgroup is
already gone, then io group will be freed hence request list pointer is
no more valid and any operation on that is bad.

So either we can take a reference on the queue, call free_request() and
then drop the reference or we can call freed_request() before
blk_free_request(). Calling freed_request() before blk_free_request()
sounds better to me. The only thing is that this function name and other
dependent function names should now be freeing_request() instead of
freed_request(). :-)

Will move freed_request() before blk_free_request() in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH 20/23] io-controller: Per cgroup request descriptor support
@ 2009-09-16 18:47       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-16 18:47 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 14, 2009 at 11:33:37AM -0700, Nauman Rafique wrote:

[..]
> >  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
> >  {
> >        struct request *rq;
> > +       struct request_list *rl;
> >
> >        BUG_ON(rw != READ && rw != WRITE);
> >
> >        spin_lock_irq(q->queue_lock);
> > +       rl = blk_get_request_list(q, NULL);
> >        if (gfp_mask & __GFP_WAIT) {
> >                rq = get_request_wait(q, rw, NULL);
> >        } else {
> > -               rq = get_request(q, rw, NULL, gfp_mask);
> > +               rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
> >                if (!rq)
> >                        spin_unlock_irq(q->queue_lock);
> >        }
> > @@ -1085,12 +1269,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> >        if (req->cmd_flags & REQ_ALLOCED) {
> >                int is_sync = rq_is_sync(req) != 0;
> >                int priv = req->cmd_flags & REQ_ELVPRIV;
> > +               struct request_list *rl = rq_rl(q, req);
> >
> >                BUG_ON(!list_empty(&req->queuelist));
> >                BUG_ON(!hlist_unhashed(&req->hash));
> >
> >                blk_free_request(q, req);
> > -               freed_request(q, is_sync, priv);
> > +               freed_request(q, is_sync, priv, rl);
> 
> We have a potential memory bug here. freed_request should be called
> before blk_free_request as blk_free_request might result in release of
> cgroup, and request_list. Calling freed_request after blk_free_request
> would result in operations on freed memory.
> 

Good point Nauman. freeing rq will drop reference on the io queue. Which
in turn will drop reference on io group and if associated cgroup is
already gone, then io group will be freed hence request list pointer is
no more valid and any operation on that is bad.

So either we can take a reference on the queue, call free_request() and
then drop the reference or we can call freed_request() before
blk_free_request(). Calling freed_request() before blk_free_request()
sounds better to me. The only thing is that this function name and other
dependent function names should now be freeing_request() instead of
freed_request(). :-)

Will move freed_request() before blk_free_request() in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]                 ` <20090916180915.GE5221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-17  6:08                   ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-17  6:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:
> 
> [..]
>>> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>>>   and fixed by Gui.
>>>
>>> o If an AS queue is not expired for a long time and suddenly somebody
>>>   decides to create a group and launch a job there, in that case old AS
>>>   queue will be expired with a very high value of slice used and will get
>>>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>>>   and charge the queue only for a single time slice and not for whole
>>>   of the duration when queue was running.
>>>
>>> o There are cases where in case of AS, excessive queue expiration will take
>>>   place by elevator fair queuing layer because of few reasons.
>>> 	- AS does not anticipate on a queue if there are no competing requests.
>>> 	  So if only a single reader is present in a group, anticipation does
>>> 	  not get turn on.
>>>
>>> 	- elevator layer does not know that As is anticipating hence initiates
>>> 	  expiry requests in select_ioq() thinking queue is empty.
>>>
>>> 	- elevaotr layer tries to aggressively expire last empty queue. This
>>> 	  can lead to lof of queue expiry
>>>
>>> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>>>   queue completed and associated io context is eligible to anticipate. Also
>>>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>>>   . This solves above mentioned issues.
>>>  
>>> o Moved some of the code in separate functions to improve readability.
>>>
>> ...
>>
>>>  /* A request got completed from io_queue. Do the accounting. */
>>>  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>>>  {
>>> @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
>>>  			elv_set_prio_slice(q->elevator->efqd, ioq);
>>>  			elv_clear_ioq_slice_new(ioq);
>>>  		}
>>> +
>>>  		/*
>>>  		 * If there is only root group present, don't expire the queue
>>>  		 * for single queue ioschedulers (noop, deadline, AS). It is
>>>  		 * unnecessary overhead.
>>>  		 */
>>>  
>>> -		if (is_only_root_group() &&
>>> -			elv_iosched_single_ioq(q->elevator)) {
>>> -			elv_log_ioq(efqd, ioq, "select: only root group,"
>>> -					" no expiry");
>>> +		if (single_ioq_no_timed_expiry(q)) {
>>   Hi Vivek,
>>
>>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>>   the root cgroup is the only group and if there is only one busy_ioq there. As
>>   I explained in previous mail, these two checks are not sufficient to say the
>>   current active ioq comes from root group. Because when the child cgroup is just
>>   removed, and the ioq which belongs to child group is still there(maybe some
>>   requests are in flight). In this case, only root cgroup and only one active ioq
>>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>>   ioq comes from root group. Am i missing something?
>>
> 
> Hi Gui,
> 
> The only side effect of not checking for "efqd->root_group->ioq" seems to
> be that this ioq hence io group of the child will not be freed immediately
> and release will be delayed until a some other queue in the system gets
> backlogged or ioscheduler exits. At that point of time, this child queue will
> be expired and ioq and iog will be freed.
> 
> The advantage is that if there is IO happening only in child group in that
> case we can avoid expiring that queue.
> 
> So may be keeping the queue around for sometime is not a very idea. 
> 
> Am I missing something?

  Hi Vivek,

  I think you're right.
  A thought comes to my mind, whether can we extend optimization for not expiring
  an ioq not only for root group, but for single ioq case? IOW, if there is only
  one ioq in hierarchy, we don't exipre it until another ioq gets backlogged, and
  we just charge only one time slice for that ioq when it schedule out. This might
  help if IO only happens in child group.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-16 18:09                 ` Vivek Goyal
@ 2009-09-17  6:08                   ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-17  6:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Vivek Goyal wrote:
> On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:
> 
> [..]
>>> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>>>   and fixed by Gui.
>>>
>>> o If an AS queue is not expired for a long time and suddenly somebody
>>>   decides to create a group and launch a job there, in that case old AS
>>>   queue will be expired with a very high value of slice used and will get
>>>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>>>   and charge the queue only for a single time slice and not for whole
>>>   of the duration when queue was running.
>>>
>>> o There are cases where in case of AS, excessive queue expiration will take
>>>   place by elevator fair queuing layer because of few reasons.
>>> 	- AS does not anticipate on a queue if there are no competing requests.
>>> 	  So if only a single reader is present in a group, anticipation does
>>> 	  not get turn on.
>>>
>>> 	- elevator layer does not know that As is anticipating hence initiates
>>> 	  expiry requests in select_ioq() thinking queue is empty.
>>>
>>> 	- elevaotr layer tries to aggressively expire last empty queue. This
>>> 	  can lead to lof of queue expiry
>>>
>>> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>>>   queue completed and associated io context is eligible to anticipate. Also
>>>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>>>   . This solves above mentioned issues.
>>>  
>>> o Moved some of the code in separate functions to improve readability.
>>>
>> ...
>>
>>>  /* A request got completed from io_queue. Do the accounting. */
>>>  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>>>  {
>>> @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
>>>  			elv_set_prio_slice(q->elevator->efqd, ioq);
>>>  			elv_clear_ioq_slice_new(ioq);
>>>  		}
>>> +
>>>  		/*
>>>  		 * If there is only root group present, don't expire the queue
>>>  		 * for single queue ioschedulers (noop, deadline, AS). It is
>>>  		 * unnecessary overhead.
>>>  		 */
>>>  
>>> -		if (is_only_root_group() &&
>>> -			elv_iosched_single_ioq(q->elevator)) {
>>> -			elv_log_ioq(efqd, ioq, "select: only root group,"
>>> -					" no expiry");
>>> +		if (single_ioq_no_timed_expiry(q)) {
>>   Hi Vivek,
>>
>>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>>   the root cgroup is the only group and if there is only one busy_ioq there. As
>>   I explained in previous mail, these two checks are not sufficient to say the
>>   current active ioq comes from root group. Because when the child cgroup is just
>>   removed, and the ioq which belongs to child group is still there(maybe some
>>   requests are in flight). In this case, only root cgroup and only one active ioq
>>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>>   ioq comes from root group. Am i missing something?
>>
> 
> Hi Gui,
> 
> The only side effect of not checking for "efqd->root_group->ioq" seems to
> be that this ioq hence io group of the child will not be freed immediately
> and release will be delayed until a some other queue in the system gets
> backlogged or ioscheduler exits. At that point of time, this child queue will
> be expired and ioq and iog will be freed.
> 
> The advantage is that if there is IO happening only in child group in that
> case we can avoid expiring that queue.
> 
> So may be keeping the queue around for sometime is not a very idea. 
> 
> Am I missing something?

  Hi Vivek,

  I think you're right.
  A thought comes to my mind, whether can we extend optimization for not expiring
  an ioq not only for root group, but for single ioq case? IOW, if there is only
  one ioq in hierarchy, we don't exipre it until another ioq gets backlogged, and
  we just charge only one time slice for that ioq when it schedule out. This might
  help if IO only happens in child group.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
@ 2009-09-17  6:08                   ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-17  6:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
> On Wed, Sep 16, 2009 at 10:58:10AM +0800, Gui Jianfeng wrote:
> 
> [..]
>>> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
>>>   and fixed by Gui.
>>>
>>> o If an AS queue is not expired for a long time and suddenly somebody
>>>   decides to create a group and launch a job there, in that case old AS
>>>   queue will be expired with a very high value of slice used and will get
>>>   a very high disk time. Fix it by marking the queue as "charge_one_slice"
>>>   and charge the queue only for a single time slice and not for whole
>>>   of the duration when queue was running.
>>>
>>> o There are cases where in case of AS, excessive queue expiration will take
>>>   place by elevator fair queuing layer because of few reasons.
>>> 	- AS does not anticipate on a queue if there are no competing requests.
>>> 	  So if only a single reader is present in a group, anticipation does
>>> 	  not get turn on.
>>>
>>> 	- elevator layer does not know that As is anticipating hence initiates
>>> 	  expiry requests in select_ioq() thinking queue is empty.
>>>
>>> 	- elevaotr layer tries to aggressively expire last empty queue. This
>>> 	  can lead to lof of queue expiry
>>>
>>> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
>>>   queue completed and associated io context is eligible to anticipate. Also
>>>   AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
>>>   . This solves above mentioned issues.
>>>  
>>> o Moved some of the code in separate functions to improve readability.
>>>
>> ...
>>
>>>  /* A request got completed from io_queue. Do the accounting. */
>>>  void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>>>  {
>>> @@ -3470,16 +3572,16 @@ void elv_ioq_completed_request(struct re
>>>  			elv_set_prio_slice(q->elevator->efqd, ioq);
>>>  			elv_clear_ioq_slice_new(ioq);
>>>  		}
>>> +
>>>  		/*
>>>  		 * If there is only root group present, don't expire the queue
>>>  		 * for single queue ioschedulers (noop, deadline, AS). It is
>>>  		 * unnecessary overhead.
>>>  		 */
>>>  
>>> -		if (is_only_root_group() &&
>>> -			elv_iosched_single_ioq(q->elevator)) {
>>> -			elv_log_ioq(efqd, ioq, "select: only root group,"
>>> -					" no expiry");
>>> +		if (single_ioq_no_timed_expiry(q)) {
>>   Hi Vivek,
>>
>>   So we make use of single_ioq_no_timed_expiry() to decide whether there is only
>>   root ioq to be busy, right? But single_ioq_no_timed_expiry() only checks if
>>   the root cgroup is the only group and if there is only one busy_ioq there. As
>>   I explained in previous mail, these two checks are not sufficient to say the
>>   current active ioq comes from root group. Because when the child cgroup is just
>>   removed, and the ioq which belongs to child group is still there(maybe some
>>   requests are in flight). In this case, only root cgroup and only one active ioq
>>   (child ioq) checks are satisfied. So IMHO, in single_ioq_no_timed_expiry() we 
>>   still need to check "efqd->root_group->ioq" is already created to ensure the only
>>   ioq comes from root group. Am i missing something?
>>
> 
> Hi Gui,
> 
> The only side effect of not checking for "efqd->root_group->ioq" seems to
> be that this ioq hence io group of the child will not be freed immediately
> and release will be delayed until a some other queue in the system gets
> backlogged or ioscheduler exits. At that point of time, this child queue will
> be expired and ioq and iog will be freed.
> 
> The advantage is that if there is IO happening only in child group in that
> case we can avoid expiring that queue.
> 
> So may be keeping the queue around for sometime is not a very idea. 
> 
> Am I missing something?

  Hi Vivek,

  I think you're right.
  A thought comes to my mind, whether can we extend optimization for not expiring
  an ioq not only for root group, but for single ioq case? IOW, if there is only
  one ioq in hierarchy, we don't exipre it until another ioq gets backlogged, and
  we just charge only one time slice for that ioq when it schedule out. This might
  help if IO only happens in child group.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 322+ messages in thread

* [PATCH] io-controller: Fix another bug that causing system hanging
       [not found]   ` <1251495072-7780-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-08-30  0:38     ` Rik van Riel
@ 2009-09-18  3:56     ` Gui Jianfeng
  1 sibling, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-18  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
...

>   * If io scheduler has functionality of keeping track of close cooperator, check
>   * with it if it has got a closely co-operating queue.
> @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  {
>  	struct elv_fq_data *efqd = q->elevator->efqd;
>  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> +	struct io_group *iog;
>  
>  	if (!elv_nr_busy_ioq(q->elevator))
>  		return NULL;
> @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	if (ioq == NULL)
>  		goto new_queue;
>  
> +	iog = ioq_to_io_group(ioq);
> +
>  	/*
>  	 * Force dispatch. Continue to dispatch from current queue as long
>  	 * as it has requests.
> @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  			goto expire;
>  	}
>  
> +	/* We are waiting for this group to become busy before it expires.*/
> +	if (elv_iog_wait_busy(iog)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> -		goto expire;
> +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +	     && !elv_ioq_must_dispatch(ioq)) {
> +		/*
> +		 * Queue has used up its slice. Wait busy is not on otherwise
> +		 * we wouldn't have been here. If this group will be deleted
> +		 * after the queue expiry, then make sure we have onece
> +		 * done wait busy on the group in an attempt to make it
> +		 * backlogged.
> +		 *
> +		 * Following check helps in two conditions.
> +		 * - If there are requests dispatched from the queue and
> +		 *   select_ioq() comes before a request completed from the
> +		 *   queue and got a chance to arm any of the idle timers.
> +		 *
> +		 * - If at request completion time slice had not expired and
> +		 *   we armed either a ioq timer or group timer but when
> +		 *   select_ioq() hits, slice has expired and it will expire
> +		 *   the queue without doing busy wait on group.
> +		 *
> +		 * In similar situations cfq lets delte the queue even if
> +		 * idle timer is armed. That does not impact fairness in non
> +		 * hierarhical setup due to weighted slice lengths. But in
> +		 * hierarchical setup where group slice lengths are derived
> +		 * from queue and is not proportional to group's weight, it
> +		 * harms the fairness of the group.
> +		 */
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {

Hi Vivek,

Here is another bug which will cause task hanging when accessing into a certain disk.
For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
is not sufficient, because idle timer can be inactive at that moment. This will cause
the ioq keeping service all the time and won't stop, causing the whole system hanging. 
This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
pending, and this ioq will be expired after timer is up.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 40d0eb5..c039ba2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		 * harms the fairness of the group.
 		 */
 		slice_expired = 1;
-		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
+		    elv_iog_wait_busy(iog)) {
 			ioq = NULL;
 			goto keep_queue;
 		} else
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH] io-controller: Fix another bug that causing system hanging
  2009-08-28 21:31   ` Vivek Goyal
@ 2009-09-18  3:56     ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-18  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel, KAMEZAWA Hiroyuki

Vivek Goyal wrote:
...

>   * If io scheduler has functionality of keeping track of close cooperator, check
>   * with it if it has got a closely co-operating queue.
> @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  {
>  	struct elv_fq_data *efqd = q->elevator->efqd;
>  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> +	struct io_group *iog;
>  
>  	if (!elv_nr_busy_ioq(q->elevator))
>  		return NULL;
> @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	if (ioq == NULL)
>  		goto new_queue;
>  
> +	iog = ioq_to_io_group(ioq);
> +
>  	/*
>  	 * Force dispatch. Continue to dispatch from current queue as long
>  	 * as it has requests.
> @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  			goto expire;
>  	}
>  
> +	/* We are waiting for this group to become busy before it expires.*/
> +	if (elv_iog_wait_busy(iog)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> -		goto expire;
> +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +	     && !elv_ioq_must_dispatch(ioq)) {
> +		/*
> +		 * Queue has used up its slice. Wait busy is not on otherwise
> +		 * we wouldn't have been here. If this group will be deleted
> +		 * after the queue expiry, then make sure we have onece
> +		 * done wait busy on the group in an attempt to make it
> +		 * backlogged.
> +		 *
> +		 * Following check helps in two conditions.
> +		 * - If there are requests dispatched from the queue and
> +		 *   select_ioq() comes before a request completed from the
> +		 *   queue and got a chance to arm any of the idle timers.
> +		 *
> +		 * - If at request completion time slice had not expired and
> +		 *   we armed either a ioq timer or group timer but when
> +		 *   select_ioq() hits, slice has expired and it will expire
> +		 *   the queue without doing busy wait on group.
> +		 *
> +		 * In similar situations cfq lets delte the queue even if
> +		 * idle timer is armed. That does not impact fairness in non
> +		 * hierarhical setup due to weighted slice lengths. But in
> +		 * hierarchical setup where group slice lengths are derived
> +		 * from queue and is not proportional to group's weight, it
> +		 * harms the fairness of the group.
> +		 */
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {

Hi Vivek,

Here is another bug which will cause task hanging when accessing into a certain disk.
For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
is not sufficient, because idle timer can be inactive at that moment. This will cause
the ioq keeping service all the time and won't stop, causing the whole system hanging. 
This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
pending, and this ioq will be expired after timer is up.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 40d0eb5..c039ba2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		 * harms the fairness of the group.
 		 */
 		slice_expired = 1;
-		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
+		    elv_iog_wait_busy(iog)) {
 			ioq = NULL;
 			goto keep_queue;
 		} else
-- 
1.5.4.rc3





^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [PATCH] io-controller: Fix another bug that causing system hanging
@ 2009-09-18  3:56     ` Gui Jianfeng
  0 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-18  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, KAMEZAWA Hiroyuki,
	containers, linux-kernel, akpm, righi.andrea, torvalds

Vivek Goyal wrote:
...

>   * If io scheduler has functionality of keeping track of close cooperator, check
>   * with it if it has got a closely co-operating queue.
> @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  {
>  	struct elv_fq_data *efqd = q->elevator->efqd;
>  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> +	struct io_group *iog;
>  
>  	if (!elv_nr_busy_ioq(q->elevator))
>  		return NULL;
> @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  	if (ioq == NULL)
>  		goto new_queue;
>  
> +	iog = ioq_to_io_group(ioq);
> +
>  	/*
>  	 * Force dispatch. Continue to dispatch from current queue as long
>  	 * as it has requests.
> @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  			goto expire;
>  	}
>  
> +	/* We are waiting for this group to become busy before it expires.*/
> +	if (elv_iog_wait_busy(iog)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> -		goto expire;
> +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +	     && !elv_ioq_must_dispatch(ioq)) {
> +		/*
> +		 * Queue has used up its slice. Wait busy is not on otherwise
> +		 * we wouldn't have been here. If this group will be deleted
> +		 * after the queue expiry, then make sure we have onece
> +		 * done wait busy on the group in an attempt to make it
> +		 * backlogged.
> +		 *
> +		 * Following check helps in two conditions.
> +		 * - If there are requests dispatched from the queue and
> +		 *   select_ioq() comes before a request completed from the
> +		 *   queue and got a chance to arm any of the idle timers.
> +		 *
> +		 * - If at request completion time slice had not expired and
> +		 *   we armed either a ioq timer or group timer but when
> +		 *   select_ioq() hits, slice has expired and it will expire
> +		 *   the queue without doing busy wait on group.
> +		 *
> +		 * In similar situations cfq lets delte the queue even if
> +		 * idle timer is armed. That does not impact fairness in non
> +		 * hierarhical setup due to weighted slice lengths. But in
> +		 * hierarchical setup where group slice lengths are derived
> +		 * from queue and is not proportional to group's weight, it
> +		 * harms the fairness of the group.
> +		 */
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {

Hi Vivek,

Here is another bug which will cause task hanging when accessing into a certain disk.
For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
is not sufficient, because idle timer can be inactive at that moment. This will cause
the ioq keeping service all the time and won't stop, causing the whole system hanging. 
This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
pending, and this ioq will be expired after timer is up.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 40d0eb5..c039ba2 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
 		 * harms the fairness of the group.
 		 */
 		slice_expired = 1;
-		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
+		    elv_iog_wait_busy(iog)) {
 			ioq = NULL;
 			goto keep_queue;
 		} else
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix another bug that causing system hanging
       [not found]     ` <4AB30508.6010206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-09-18 14:47       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-18 14:47 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 18, 2009 at 11:56:56AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> 
> >   * If io scheduler has functionality of keeping track of close cooperator, check
> >   * with it if it has got a closely co-operating queue.
> > @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  {
> >  	struct elv_fq_data *efqd = q->elevator->efqd;
> >  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> > +	struct io_group *iog;
> >  
> >  	if (!elv_nr_busy_ioq(q->elevator))
> >  		return NULL;
> > @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  	if (ioq == NULL)
> >  		goto new_queue;
> >  
> > +	iog = ioq_to_io_group(ioq);
> > +
> >  	/*
> >  	 * Force dispatch. Continue to dispatch from current queue as long
> >  	 * as it has requests.
> > @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  			goto expire;
> >  	}
> >  
> > +	/* We are waiting for this group to become busy before it expires.*/
> > +	if (elv_iog_wait_busy(iog)) {
> > +		ioq = NULL;
> > +		goto keep_queue;
> > +	}
> > +
> >  	/*
> >  	 * The active queue has run out of time, expire it and select new.
> >  	 */
> > -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> > -		goto expire;
> > +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> > +	     && !elv_ioq_must_dispatch(ioq)) {
> > +		/*
> > +		 * Queue has used up its slice. Wait busy is not on otherwise
> > +		 * we wouldn't have been here. If this group will be deleted
> > +		 * after the queue expiry, then make sure we have onece
> > +		 * done wait busy on the group in an attempt to make it
> > +		 * backlogged.
> > +		 *
> > +		 * Following check helps in two conditions.
> > +		 * - If there are requests dispatched from the queue and
> > +		 *   select_ioq() comes before a request completed from the
> > +		 *   queue and got a chance to arm any of the idle timers.
> > +		 *
> > +		 * - If at request completion time slice had not expired and
> > +		 *   we armed either a ioq timer or group timer but when
> > +		 *   select_ioq() hits, slice has expired and it will expire
> > +		 *   the queue without doing busy wait on group.
> > +		 *
> > +		 * In similar situations cfq lets delte the queue even if
> > +		 * idle timer is armed. That does not impact fairness in non
> > +		 * hierarhical setup due to weighted slice lengths. But in
> > +		 * hierarchical setup where group slice lengths are derived
> > +		 * from queue and is not proportional to group's weight, it
> > +		 * harms the fairness of the group.
> > +		 */
> > +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> 
> Hi Vivek,
> 
> Here is another bug which will cause task hanging when accessing into a certain disk.
> For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
> expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
> is not sufficient, because idle timer can be inactive at that moment. This will cause
> the ioq keeping service all the time and won't stop, causing the whole system hanging. 
> This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
> pending, and this ioq will be expired after timer is up.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

Good point. I think keeping the signle queue around with-in child group is
getting complicated.

For the time being I will continue to expire the single ioq of child group
even if other competing queues are not present. (Bring back the check of
efqd->root_group->ioq).

Once rest of the things stablize, we can revisit this optimzation later
that don't expire the single queue in child groups also.

Thanks
Vivek

> ---
>  block/elevator-fq.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 40d0eb5..c039ba2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  		 * harms the fairness of the group.
>  		 */
>  		slice_expired = 1;
> -		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
> +		    elv_iog_wait_busy(iog)) {
>  			ioq = NULL;
>  			goto keep_queue;
>  		} else
> -- 
> 1.5.4.rc3
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix another bug that causing system hanging
  2009-09-18  3:56     ` Gui Jianfeng
@ 2009-09-18 14:47       ` Vivek Goyal
  -1 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-18 14:47 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel, KAMEZAWA Hiroyuki

On Fri, Sep 18, 2009 at 11:56:56AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> 
> >   * If io scheduler has functionality of keeping track of close cooperator, check
> >   * with it if it has got a closely co-operating queue.
> > @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  {
> >  	struct elv_fq_data *efqd = q->elevator->efqd;
> >  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> > +	struct io_group *iog;
> >  
> >  	if (!elv_nr_busy_ioq(q->elevator))
> >  		return NULL;
> > @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  	if (ioq == NULL)
> >  		goto new_queue;
> >  
> > +	iog = ioq_to_io_group(ioq);
> > +
> >  	/*
> >  	 * Force dispatch. Continue to dispatch from current queue as long
> >  	 * as it has requests.
> > @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  			goto expire;
> >  	}
> >  
> > +	/* We are waiting for this group to become busy before it expires.*/
> > +	if (elv_iog_wait_busy(iog)) {
> > +		ioq = NULL;
> > +		goto keep_queue;
> > +	}
> > +
> >  	/*
> >  	 * The active queue has run out of time, expire it and select new.
> >  	 */
> > -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> > -		goto expire;
> > +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> > +	     && !elv_ioq_must_dispatch(ioq)) {
> > +		/*
> > +		 * Queue has used up its slice. Wait busy is not on otherwise
> > +		 * we wouldn't have been here. If this group will be deleted
> > +		 * after the queue expiry, then make sure we have onece
> > +		 * done wait busy on the group in an attempt to make it
> > +		 * backlogged.
> > +		 *
> > +		 * Following check helps in two conditions.
> > +		 * - If there are requests dispatched from the queue and
> > +		 *   select_ioq() comes before a request completed from the
> > +		 *   queue and got a chance to arm any of the idle timers.
> > +		 *
> > +		 * - If at request completion time slice had not expired and
> > +		 *   we armed either a ioq timer or group timer but when
> > +		 *   select_ioq() hits, slice has expired and it will expire
> > +		 *   the queue without doing busy wait on group.
> > +		 *
> > +		 * In similar situations cfq lets delte the queue even if
> > +		 * idle timer is armed. That does not impact fairness in non
> > +		 * hierarhical setup due to weighted slice lengths. But in
> > +		 * hierarchical setup where group slice lengths are derived
> > +		 * from queue and is not proportional to group's weight, it
> > +		 * harms the fairness of the group.
> > +		 */
> > +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> 
> Hi Vivek,
> 
> Here is another bug which will cause task hanging when accessing into a certain disk.
> For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
> expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
> is not sufficient, because idle timer can be inactive at that moment. This will cause
> the ioq keeping service all the time and won't stop, causing the whole system hanging. 
> This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
> pending, and this ioq will be expired after timer is up.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Good point. I think keeping the signle queue around with-in child group is
getting complicated.

For the time being I will continue to expire the single ioq of child group
even if other competing queues are not present. (Bring back the check of
efqd->root_group->ioq).

Once rest of the things stablize, we can revisit this optimzation later
that don't expire the single queue in child groups also.

Thanks
Vivek

> ---
>  block/elevator-fq.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 40d0eb5..c039ba2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  		 * harms the fairness of the group.
>  		 */
>  		slice_expired = 1;
> -		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
> +		    elv_iog_wait_busy(iog)) {
>  			ioq = NULL;
>  			goto keep_queue;
>  		} else
> -- 
> 1.5.4.rc3
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix another bug that causing system hanging
@ 2009-09-18 14:47       ` Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-09-18 14:47 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, s-uchida, KAMEZAWA Hiroyuki,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Fri, Sep 18, 2009 at 11:56:56AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> 
> >   * If io scheduler has functionality of keeping track of close cooperator, check
> >   * with it if it has got a closely co-operating queue.
> > @@ -2057,6 +2171,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  {
> >  	struct elv_fq_data *efqd = q->elevator->efqd;
> >  	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
> > +	struct io_group *iog;
> >  
> >  	if (!elv_nr_busy_ioq(q->elevator))
> >  		return NULL;
> > @@ -2064,6 +2179,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  	if (ioq == NULL)
> >  		goto new_queue;
> >  
> > +	iog = ioq_to_io_group(ioq);
> > +
> >  	/*
> >  	 * Force dispatch. Continue to dispatch from current queue as long
> >  	 * as it has requests.
> > @@ -2075,11 +2192,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
> >  			goto expire;
> >  	}
> >  
> > +	/* We are waiting for this group to become busy before it expires.*/
> > +	if (elv_iog_wait_busy(iog)) {
> > +		ioq = NULL;
> > +		goto keep_queue;
> > +	}
> > +
> >  	/*
> >  	 * The active queue has run out of time, expire it and select new.
> >  	 */
> > -	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
> > -		goto expire;
> > +	if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> > +	     && !elv_ioq_must_dispatch(ioq)) {
> > +		/*
> > +		 * Queue has used up its slice. Wait busy is not on otherwise
> > +		 * we wouldn't have been here. If this group will be deleted
> > +		 * after the queue expiry, then make sure we have onece
> > +		 * done wait busy on the group in an attempt to make it
> > +		 * backlogged.
> > +		 *
> > +		 * Following check helps in two conditions.
> > +		 * - If there are requests dispatched from the queue and
> > +		 *   select_ioq() comes before a request completed from the
> > +		 *   queue and got a chance to arm any of the idle timers.
> > +		 *
> > +		 * - If at request completion time slice had not expired and
> > +		 *   we armed either a ioq timer or group timer but when
> > +		 *   select_ioq() hits, slice has expired and it will expire
> > +		 *   the queue without doing busy wait on group.
> > +		 *
> > +		 * In similar situations cfq lets delte the queue even if
> > +		 * idle timer is armed. That does not impact fairness in non
> > +		 * hierarhical setup due to weighted slice lengths. But in
> > +		 * hierarchical setup where group slice lengths are derived
> > +		 * from queue and is not proportional to group's weight, it
> > +		 * harms the fairness of the group.
> > +		 */
> > +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> 
> Hi Vivek,
> 
> Here is another bug which will cause task hanging when accessing into a certain disk.
> For the moment, last ioq(corresponding CGroup has been removed) is optimized not to 
> expire unitl another ioq get backlogged. Here just checking "iog_wait_busy_done" flag
> is not sufficient, because idle timer can be inactive at that moment. This will cause
> the ioq keeping service all the time and won't stop, causing the whole system hanging. 
> This patch adds extra check for "iog_wait_busy" to make sure that the idle timer is 
> pending, and this ioq will be expired after timer is up.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>

Good point. I think keeping the signle queue around with-in child group is
getting complicated.

For the time being I will continue to expire the single ioq of child group
even if other competing queues are not present. (Bring back the check of
efqd->root_group->ioq).

Once rest of the things stablize, we can revisit this optimzation later
that don't expire the single queue in child groups also.

Thanks
Vivek

> ---
>  block/elevator-fq.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 40d0eb5..c039ba2 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -3364,7 +3364,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
>  		 * harms the fairness of the group.
>  		 */
>  		slice_expired = 1;
> -		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
> +		if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog) &&
> +		    elv_iog_wait_busy(iog)) {
>  			ioq = NULL;
>  			goto keep_queue;
>  		} else
> -- 
> 1.5.4.rc3
> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
       [not found]             ` <20090915033739.GA4054-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-16  0:05               ` Gui Jianfeng
  2009-09-16  2:58               ` Gui Jianfeng
@ 2009-09-24  1:10               ` Gui Jianfeng
  2 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-24  1:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Currently, we just set this flag when anticipating next request.
So make sure we remove this flag also in this case.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/as-iosched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 5868e72..7a64232 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -728,9 +728,10 @@ static void as_antic_stop(struct as_data *ad)
 	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
 
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
-		if (status == ANTIC_WAIT_NEXT)
+		if (status == ANTIC_WAIT_NEXT) {
 			del_timer(&ad->antic_timer);
-		as_clear_active_asq_wait_request(ad);
+			as_clear_active_asq_wait_request(ad);
+		}
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
 		kblockd_schedule_work(ad->q, &ad->antic_work);
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [PATCH] io-controller: Fix task hanging when there are more than one groups
  2009-09-15  3:37             ` Vivek Goyal
                               ` (3 preceding siblings ...)
  (?)
@ 2009-09-24  1:10             ` Gui Jianfeng
  -1 siblings, 0 replies; 322+ messages in thread
From: Gui Jianfeng @ 2009-09-24  1:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jens.axboe, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, mingo, riel

Hi Vivek,

Currently, we just set this flag when anticipating next request.
So make sure we remove this flag also in this case.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/as-iosched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 5868e72..7a64232 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -728,9 +728,10 @@ static void as_antic_stop(struct as_data *ad)
 	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
 
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
-		if (status == ANTIC_WAIT_NEXT)
+		if (status == ANTIC_WAIT_NEXT) {
 			del_timer(&ad->antic_timer);
-		as_clear_active_asq_wait_request(ad);
+			as_clear_active_asq_wait_request(ad);
+		}
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
 		kblockd_schedule_work(ad->q, &ad->antic_work);
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 322+ messages in thread

* [RFC] IO scheduler based IO controller V9
@ 2009-08-28 21:30 Vivek Goyal
  0 siblings, 0 replies; 322+ messages in thread
From: Vivek Goyal @ 2009-08-28 21:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


Hi All,

Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch

Changes from V8
===============
- Implemented bdi like congestion semantics for io group also. Now once an
  io group gets congested, we don't clear the congestion flag until number
  of requests goes below nr_congestion_off.

  This helps in getting rid of Buffered write performance regression we
  were observing with io controller patches.

  Gui, can you please test it and see if this version is better in terms
  of your buffered write tests.

- Moved some of the functions from blk-core.c to elevator-fq.c. This reduces
  CONFIG_GROUP_IOSCHED ifdefs in blk-core.c and code looks little more clean. 

- Fixed issue of add_front where we go left on rb-tree if add_front is
  specified in case of preemption.

- Requeue async ioq after one round of dispatch. This helps emulationg
  CFQ behavior.

- Pulled in v11 of io tracking patches and modified config option so that if
  CONFIG_TRACK_ASYNC_CONTEXT is not enabled, blkio is not compiled in.

- Fixed some block tracepoints which were broken because of per group request
  list changes.

- Fixed some logging messages.

- Got rid of extra call to update_prio as pointed out by Jerome and Gui.

- Merged the fix from jerome for a crash while chaning prio.

- Got rid of redundant slice_start assignment as pointed by Gui.

- Merged a elv_ioq_nr_dispatched() cleanup from Gui.

- Fixed a compilation issue if CONFIG_BLOCK=n.
 
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight. 

How to solve the problem
=========================

Different people have solved the issue differetnly. At least there are now
three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here I have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux
IO schedulers as flat where there is one root group and all the IO belongs to
that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I 
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling. 

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

- IO throttling is a max bandwidth controller and not a proportional one.
  Additionaly it provides fairness in terms of amount of IO done (and not in
  terms of disk time as CFQ does).

  Personally, I think that proportional weight controller is useful to more
  people than just max bandwidth controller. In addition, IO scheduler based
  controller can also be enhanced to do max bandwidth control, if need be.

- dm-ioband also provides fairness in terms of amount of IO done not in terms
  of disk time. So a seeky process can still run away with lot more disk time.
  Now this is an interesting question that how fairness among groups should be
  viewed and what is more relevant. Should fairness be based on amount of IO
  done or amount of disk time consumed as CFQ does. IO scheduler based
  controller provides fairness in terms of disk time used.

- IO throttling and dm-ioband both are second level controller. That is these
  controllers are implemented in higher layers than io schedulers. So they
  control the IO at higher layer based on group policies and later IO
  schedulers take care of dispatching these bios to disk.

  Implementing a second level controller has the advantage of being able to
  provide bandwidth control even on logical block devices in the IO stack
  which don't have any IO schedulers attached to these. But they can also 
  interefere with IO scheduling policy of underlying IO scheduler and change
  the effective behavior. Following are some of the issues which I think
  should be visible in second level controller in one form or other.

  Prio with-in group
  ------------------
  A second level controller can potentially interefere with behavior of
  different prio processes with-in a group. bios are buffered at higher layer
  in single queue and release of bios is FIFO and not proportionate to the
  ioprio of the process. This can result in a particular prio level not
  getting fair share.

  Buffering at higher layer can delay read requests for more than slice idle
  period of CFQ (default 8 ms). That means, it is possible that we are waiting
  for a request from the queue but it is buffered at higher layer and then idle
  timer will fire. It means that queue will losse its share at the same time
  overall throughput will be impacted as we lost those 8 ms.
  
  Read Vs Write
  -------------
  Writes can overwhelm readers hence second level controller FIFO release
  will run into issue here. If there is a single queue maintained then reads
  will suffer large latencies. If there separate queues for reads and writes
  then it will be hard to decide in what ratio to dispatch reads and writes as
  it is IO scheduler's decision to decide when and how much read/write to
  dispatch. This is another place where higher level controller will not be in
  sync with lower level io scheduler and can change the effective policies of
  underlying io scheduler.

  Fairness in terms of disk time / size of IO
  ---------------------------------------------
  An higher level controller will most likely be limited to providing fairness
  in terms of size of IO done and will find it hard to provide fairness in
  terms of disk time used (as CFQ provides between various prio levels). This
  is because only IO scheduler knows how much disk time a queue has used.

  Not sure how useful it is to have fairness in terms of secotrs as CFQ has
  been providing fairness in terms of disk time. So a seeky application will
  still run away with lot of disk time and bring down the overall throughput
  of the the disk more than usual.

  CFQ IO context Issues
  ---------------------
  Buffering at higher layer means submission of bios later with the help of
  a worker thread. This changes the io context information at CFQ layer which
  assigns the request to submitting thread. Change of io context info again
  leads to issues of idle timer expiry and issue of a process not getting fair
  share and reduced throughput.

  Throughput with noop, deadline and AS
  ---------------------------------------------
  I think an higher level controller will result in reduced overall throughput
  (as compared to io scheduler based io controller) and more seeks with noop,
  deadline and AS.

  The reason being, that it is likely that IO with-in a group will be related
  and will be relatively close as compared to IO across the groups. For example,
  thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
  control, IO from various groups will go into a single queue at lower level
  controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
  G4....) causing more seeks and reduced throughput. (Agreed that merging will
  help up to some extent but still....).

  Instead, in case of lower level controller, IO scheduler maintains one queue
  per group hence there is no interleaving of IO between groups. And if IO is
  related with-in group, then we shoud get reduced number/amount of seek and
  higher throughput.

  Latency can be a concern but that can be controlled by reducing the time
  slice length of the queue.

- IO scheduler based controller has the limitation that it works only with the
  bottom most devices in the IO stack where IO scheduler is attached. Now the
  question comes that how important/relevant it is to control bandwidth at
  higher level logical devices also. The actual contention for resources is
  at the leaf block device so it probably makes sense to do any kind of
  control there and not at the intermediate devices. Secondly probably it
  also means better use of available resources.

  For example, assume a user has created a linear logical device lv0 using
  three underlying disks sda, sdb and sdc. Also assume there are two tasks
  T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups
  are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

			     T1    T2
			       \   /
			        lv0
			      /  |  \
			    sda sdb  sdc

  Now if IO control is done at lv0 level, then if T1 is doing IO to only sda,
  and T2's IO is going to sdc. In this case there is no need of resource
  management as both the IOs don't have any contention where it matters. If we
  try to do IO control at lv0 device, it will not be an optimal usage of
  resources and will bring down overall throughput.

IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently. But I am all ears to alternative approaches and
suggestions how doing things can be done better.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Open Issues
===========
- Currently for async requests like buffered writes, we get the io group
  information from the page instead of the task context. How important it is
  to determine the context from page?

  Can we put all the pdflush threads into a separate group and control system
  wide buffered write bandwidth. Any buffered writes submitted by the process
  directly will any way go to right group.

  If it is acceptable then we can drop all the code associated with async io
  context and that should simplify the patchset a lot.  

Testing
=======
I have divided testing results in three sections. 

- Latency
- Throughput and Fairness
- Group Fairness

Because I have enhanced CFQ to also do group scheduling, one of the concerns
has been that existing CFQ should not regress at least in flat setup. If
one creates groups and puts tasks in those, then this is new environment and
some properties can change because groups have this additional requirement
of providing isolation also.

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
 
Latency Testing
++++++++++++++++

Test1: fsync-test with torture test from linus as background writer
------------------------------------------------------------
I looked at Ext3 fsync latency thread and picked fsync-test from Theodore Ts'o
and torture test from Linus as background writer to see how are the fsync
completion latencies. Following are the results.

Vanilla CFQ              IOC                    IOC (with map async)
===========             =================        ====================
fsync time: 0.2515      fsync time: 0.8580      fsync time: 0.0531
fsync time: 0.1082      fsync time: 0.1408      fsync time: 0.8907
fsync time: 0.2106      fsync time: 0.3228      fsync time: 0.2709
fsync time: 0.2591      fsync time: 0.0978      fsync time: 0.3198
fsync time: 0.2776      fsync time: 0.3035      fsync time: 0.0886
fsync time: 0.2530      fsync time: 0.0903      fsync time: 0.3035
fsync time: 0.2271      fsync time: 0.2712      fsync time: 0.0961
fsync time: 0.1057      fsync time: 0.3357      fsync time: 0.1048
fsync time: 0.1699      fsync time: 0.3175      fsync time: 0.2582
fsync time: 0.1923      fsync time: 0.2964      fsync time: 0.0876
fsync time: 0.1805      fsync time: 0.0971      fsync time: 0.2546
fsync time: 0.2944      fsync time: 0.2728      fsync time: 0.3059
fsync time: 0.1420      fsync time: 0.1079      fsync time: 0.2973
fsync time: 0.2650      fsync time: 0.3103      fsync time: 0.2032
fsync time: 0.1581      fsync time: 0.1987      fsync time: 0.2926
fsync time: 0.2656      fsync time: 0.3048      fsync time: 0.1934
fsync time: 0.2666      fsync time: 0.3092      fsync time: 0.2954
fsync time: 0.1272      fsync time: 0.0165      fsync time: 0.2952
fsync time: 0.2655      fsync time: 0.2827      fsync time: 0.2394
fsync time: 0.0147      fsync time: 0.0068      fsync time: 0.0454
fsync time: 0.2296      fsync time: 0.2923      fsync time: 0.2936
fsync time: 0.0069      fsync time: 0.3021      fsync time: 0.0397
fsync time: 0.2668      fsync time: 0.1032      fsync time: 0.2762
fsync time: 0.1932      fsync time: 0.0962      fsync time: 0.2946
fsync time: 0.1895      fsync time: 0.3545      fsync time: 0.0774
fsync time: 0.2577      fsync time: 0.2406      fsync time: 0.3027
fsync time: 0.4935      fsync time: 0.7193      fsync time: 0.2984
fsync time: 0.2804      fsync time: 0.3251      fsync time: 0.1057
fsync time: 0.2685      fsync time: 0.1001      fsync time: 0.3145
fsync time: 0.1946      fsync time: 0.2525      fsync time: 0.2992

IOC--> With IO controller patches applied. CONFIG_TRACK_ASYNC_CONTEXT=n
IOC(map async) --> IO controller patches with CONFIG_TRACK_ASYNC_CONTEXT=y

If CONFIG_TRACK_ASYNC_CONTEXT=y, async requests are mapped to the group based
on cgroup info stored in page otherwise these are mapped to the cgroup
submitting task belongs to.

Notes: 
- It looks like that max fsync time is a bit higher with IO controller
  patches. Wil dig more into it later.

Test2: read small files with multiple sequential readers (10) runnning
======================================================================
Took Ingo's small file reader test and ran it while 10 sequential readers
were running.

Vanilla CFQ     IOC (flat)      IOC (10 readers in 10 groups)
0.12 seconds    0.11 seconds    1.62 seconds
0.05 seconds    0.05 seconds    1.18 seconds
0.05 seconds    0.05 seconds    1.17 seconds
0.03 seconds    0.04 seconds    1.18 seconds
1.15 seconds    1.17 seconds    1.29 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.18 seconds    1.18 seconds
1.15 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.15 seconds    1.17 seconds
0.04 seconds    0.04 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.17 seconds
1.18 seconds    1.15 seconds    1.28 seconds
1.18 seconds    1.15 seconds    1.18 seconds
1.17 seconds    1.16 seconds    1.18 seconds
1.17 seconds    1.18 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.15 seconds    1.17 seconds
1.16 seconds    1.15 seconds    1.17 seconds
1.15 seconds    1.15 seconds    1.18 seconds
1.18 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds
1.17 seconds    1.16 seconds    1.17 seconds
1.16 seconds    1.16 seconds    1.17 seconds

In third column, 10 readers have been put into 10 groups instead of running
into root group. Small file reader runs in to root group.

Notes: It looks like that here read latencies remain same as with vanilla CFQ.

Test3: read small files with multiple writers (8) runnning
==========================================================
Again running small file reader test with 8 buffered writers running with
prio 0 to 7.

Latency results are in seconds. Tried to capture the output with multiple
configurations of IO controller to see the effect.

Vanilla  IOC     IOC     IOC     IOC    IOC     IOC
        (flat)(groups) (groups) (map)  (map)    (map)
                (f=0)   (f=1)   (flat) (groups) (groups)
                                        (f=0)   (f=1)
0.25    0.03    0.31    0.25    0.29    1.25    0.39
0.27    0.28    0.28    0.30    0.41    0.90    0.80
0.25    0.24    0.23    0.37    0.27    1.17    0.24
0.14    0.14    0.14    0.13    0.15    0.10    1.11
0.14    0.16    0.13    0.16    0.15    0.06    0.58
0.16    0.11    0.15    0.12    0.19    0.05    0.14
0.03    0.17    0.12    0.17    0.04    0.12    0.12
0.13    0.13    0.13    0.14    0.03    0.05    0.05
0.18    0.13    0.17    0.09    0.09    0.05    0.07
0.11    0.18    0.16    0.18    0.14    0.05    0.12
0.28    0.14    0.15    0.15    0.13    0.02    0.04
0.16    0.14    0.14    0.12    0.15    0.00    0.13
0.14    0.13    0.14    0.13    0.13    0.02    0.02
0.13    0.11    0.12    0.14    0.15    0.06    0.01
0.27    0.28    0.32    0.24    0.25    0.01    0.01
0.14    0.15    0.18    0.15    0.13    0.06    0.02
0.15    0.13    0.13    0.13    0.13    0.00    0.04
0.15    0.13    0.15    0.14    0.15    0.01    0.05
0.11    0.17    0.15    0.13    0.13    0.02    0.00
0.17    0.13    0.17    0.12    0.18    0.39    0.01
0.18    0.16    0.14    0.16    0.14    0.89    0.47
0.13    0.13    0.14    0.04    0.12    0.64    0.78
0.16    0.15    0.19    0.11    0.16    0.67    1.17
0.04    0.12    0.14    0.04    0.18    0.67    0.63
0.03    0.13    0.17    0.11    0.15    0.61    0.69
0.15    0.16    0.13    0.14    0.13    0.77    0.66
0.12    0.12    0.15    0.11    0.13    0.92    0.73
0.15    0.12    0.15    0.16    0.13    0.70    0.73
0.11    0.13    0.15    0.10    0.18    0.73    0.82
0.16    0.19    0.15    0.16    0.14    0.71    0.74
0.28    0.05    0.26    0.22    0.17    2.91    0.79
0.13    0.05    0.14    0.14    0.14    0.44    0.65
0.16    0.22    0.18    0.13    0.26    0.31    0.65
0.10    0.13    0.12    0.11    0.16    0.25    0.66
0.13    0.14    0.16    0.15    0.12    0.17    0.76
0.19    0.11    0.12    0.14    0.17    0.20    0.71
0.16    0.15    0.14    0.15    0.11    0.19    0.68
0.13    0.13    0.13    0.13    0.16    0.04    0.78
0.14    0.16    0.15    0.17    0.15    1.20    0.80
0.17    0.13    0.14    0.18    0.14    0.76    0.63

f(0/1)--> refers to "fairness" tunable. This is new tunable part of CFQ. It
  	  set, we wait for requests from one queue to finish before new
	  queue is scheduled in.

group ---> writers are running into individual groups and not in root group.
map---> buffered writes are mapped to group using info stored in page.

Notes: Except the case of column 6 and 7 when writeres are in separate groups
and we are mapping their writes to respective group, latencies seem to be
fine. I think the latencies are higher for the last two cases because
now the reader can't preempt the writer.

				root
			       / \  \ \
			      R  G1 G2 G3
				 |  |  |
				 W  W  W
Test4: Random Reader test in presece of 4 sequential readers and 4 buffered
       writers
============================================================================
Used fio to this time to run one random reader and see how does it fair in
the presence of 4 sequential readers and 4 writers.

I have just pasted the output of random reader from fio.

Vanilla Kernel, Three runs
--------------------------
read : io=20,512KiB, bw=349KiB/s, iops=10, runt= 60075msec
clat (usec): min=944, max=2,675K, avg=93715.04, stdev=305815.90

read : io=13,696KiB, bw=233KiB/s, iops=7, runt= 60035msec
clat (msec): min=2, max=1,812, avg=140.26, stdev=382.55

read : io=13,824KiB, bw=235KiB/s, iops=7, runt= 60185msec
clat (usec): min=766, max=2,025K, avg=139310.55, stdev=383647.54

IO controller kernel, Three runs
--------------------------------
read : io=10,304KiB, bw=175KiB/s, iops=5, runt= 60083msec
clat (msec): min=2, max=2,654, avg=186.59, stdev=524.08

read : io=10,176KiB, bw=173KiB/s, iops=5, runt= 60054msec
clat (usec): min=792, max=2,567K, avg=188841.70, stdev=517154.75

read : io=11,040KiB, bw=188KiB/s, iops=5, runt= 60003msec
clat (usec): min=779, max=2,625K, avg=173915.56, stdev=508118.60

Notes:
- Looks like vanilla CFQ gives a bit more disk access to random reader. Will
  dig into it.

Throughput and Fairness
+++++++++++++++++++++++
Test5: Bandwidth distribution between 4 sequential readers and 4 buffered
       writers
==========================================================================
Used fio to launch 4 sequential readers and 4 buffered writers and watched
how BW is distributed.

Vanilla kernel, Three sets
--------------------------
read : io=962MiB, bw=16,818KiB/s, iops=513, runt= 60008msec
read : io=969MiB, bw=16,920KiB/s, iops=516, runt= 60077msec
read : io=978MiB, bw=17,063KiB/s, iops=520, runt= 60096msec
read : io=922MiB, bw=16,106KiB/s, iops=491, runt= 60057msec
write: io=235MiB, bw=4,099KiB/s, iops=125, runt= 60049msec
write: io=226MiB, bw=3,944KiB/s, iops=120, runt= 60049msec
write: io=215MiB, bw=3,747KiB/s, iops=114, runt= 60049msec
write: io=207MiB, bw=3,606KiB/s, iops=110, runt= 60049msec
READ: io=3,832MiB, aggrb=66,868KiB/s, minb=16,106KiB/s, maxb=17,063KiB/s,
mint=60008msec, maxt=60096msec
WRITE: io=882MiB, aggrb=15,398KiB/s, minb=3,606KiB/s, maxb=4,099KiB/s,
mint=60049msec, maxt=60049msec

read : io=1,002MiB, bw=17,513KiB/s, iops=534, runt= 60020msec
read : io=979MiB, bw=17,085KiB/s, iops=521, runt= 60080msec
read : io=953MiB, bw=16,637KiB/s, iops=507, runt= 60092msec
read : io=920MiB, bw=16,057KiB/s, iops=490, runt= 60108msec
write: io=215MiB, bw=3,560KiB/s, iops=108, runt= 63289msec
write: io=136MiB, bw=2,361KiB/s, iops=72, runt= 60502msec
write: io=127MiB, bw=2,101KiB/s, iops=64, runt= 63289msec
write: io=233MiB, bw=3,852KiB/s, iops=117, runt= 63289msec
READ: io=3,855MiB, aggrb=67,256KiB/s, minb=16,057KiB/s, maxb=17,513KiB/s,
mint=60020msec, maxt=60108msec
WRITE: io=711MiB, aggrb=11,771KiB/s, minb=2,101KiB/s, maxb=3,852KiB/s,
mint=60502msec, maxt=63289msec

read : io=985MiB, bw=17,179KiB/s, iops=524, runt= 60149msec
read : io=974MiB, bw=17,025KiB/s, iops=519, runt= 60002msec
read : io=962MiB, bw=16,772KiB/s, iops=511, runt= 60170msec
read : io=932MiB, bw=16,280KiB/s, iops=496, runt= 60057msec
write: io=177MiB, bw=2,933KiB/s, iops=89, runt= 63094msec
write: io=152MiB, bw=2,637KiB/s, iops=80, runt= 60323msec
write: io=240MiB, bw=3,983KiB/s, iops=121, runt= 63094msec
write: io=147MiB, bw=2,439KiB/s, iops=74, runt= 63094msec
READ: io=3,855MiB, aggrb=67,174KiB/s, minb=16,280KiB/s, maxb=17,179KiB/s,
mint=60002msec, maxt=60170msec
WRITE: io=715MiB, aggrb=11,877KiB/s, minb=2,439KiB/s, maxb=3,983KiB/s,
mint=60323msec, maxt=63094msec

IO controller kernel three sets
-------------------------------
read : io=944MiB, bw=16,483KiB/s, iops=503, runt= 60055msec
read : io=941MiB, bw=16,433KiB/s, iops=501, runt= 60073msec
read : io=900MiB, bw=15,713KiB/s, iops=479, runt= 60040msec
read : io=866MiB, bw=15,112KiB/s, iops=461, runt= 60086msec
write: io=244MiB, bw=4,262KiB/s, iops=130, runt= 60040msec
write: io=177MiB, bw=3,085KiB/s, iops=94, runt= 60042msec
write: io=158MiB, bw=2,758KiB/s, iops=84, runt= 60041msec
write: io=180MiB, bw=3,137KiB/s, iops=95, runt= 60040msec
READ: io=3,651MiB, aggrb=63,718KiB/s, minb=15,112KiB/s, maxb=16,483KiB/s,
mint=60040msec, maxt=60086msec
WRITE: io=758MiB, aggrb=13,243KiB/s, minb=2,758KiB/s, maxb=4,262KiB/s,
mint=60040msec, maxt=60042msec

read : io=960MiB, bw=16,734KiB/s, iops=510, runt= 60137msec
read : io=917MiB, bw=16,001KiB/s, iops=488, runt= 60122msec
read : io=897MiB, bw=15,683KiB/s, iops=478, runt= 60004msec
read : io=908MiB, bw=15,824KiB/s, iops=482, runt= 60149msec
write: io=209MiB, bw=3,563KiB/s, iops=108, runt= 61400msec
write: io=177MiB, bw=3,030KiB/s, iops=92, runt= 61400msec
write: io=200MiB, bw=3,409KiB/s, iops=104, runt= 61400msec
write: io=204MiB, bw=3,489KiB/s, iops=106, runt= 61400msec
READ: io=3,682MiB, aggrb=64,194KiB/s, minb=15,683KiB/s, maxb=16,734KiB/s,
mint=60004msec, maxt=60149msec
WRITE: io=790MiB, aggrb=13,492KiB/s, minb=3,030KiB/s, maxb=3,563KiB/s,
mint=61400msec, maxt=61400msec

read : io=968MiB, bw=16,867KiB/s, iops=514, runt= 60158msec
read : io=925MiB, bw=16,135KiB/s, iops=492, runt= 60142msec
read : io=875MiB, bw=15,286KiB/s, iops=466, runt= 60003msec
read : io=872MiB, bw=15,221KiB/s, iops=464, runt= 60049msec
write: io=213MiB, bw=3,720KiB/s, iops=113, runt= 60162msec
write: io=203MiB, bw=3,536KiB/s, iops=107, runt= 60163msec
write: io=208MiB, bw=3,620KiB/s, iops=110, runt= 60162msec
write: io=203MiB, bw=3,538KiB/s, iops=107, runt= 60163msec
READ: io=3,640MiB, aggrb=63,439KiB/s, minb=15,221KiB/s, maxb=16,867KiB/s,
mint=60003msec, maxt=60158msec
WRITE: io=827MiB, aggrb=14,415KiB/s, minb=3,536KiB/s, maxb=3,720KiB/s,
mint=60162msec, maxt=60163msec

Notes: It looks like vanilla CFQ favors readers a bit more over writers as
       compared to io controller cfq. Will dig into it.
	 
Test6: Bandwidth distribution between readers of diff prio
==========================================================
Using fio, ran 8 readers of prio 0 to 7 and let it run for 30 seconds and
watched for overall throughput and who got how much IO done. 

Vanilla kernel, Three sets
---------------------------
read : io=454MiB, bw=15,865KiB/s, iops=484, runt= 30004msec
read : io=382MiB, bw=13,330KiB/s, iops=406, runt= 30086msec
read : io=325MiB, bw=11,330KiB/s, iops=345, runt= 30074msec
read : io=294MiB, bw=10,253KiB/s, iops=312, runt= 30062msec
read : io=238MiB, bw=8,321KiB/s, iops=253, runt= 30048msec
read : io=145MiB, bw=5,061KiB/s, iops=154, runt= 30032msec
read : io=99MiB, bw=3,456KiB/s, iops=105, runt= 30021msec
read : io=67,040KiB, bw=2,280KiB/s, iops=69, runt= 30108msec
READ: io=2,003MiB, aggrb=69,767KiB/s, minb=2,280KiB/s, maxb=15,865KiB/s,
mint=30004msec, maxt=30108msec

read : io=450MiB, bw=15,727KiB/s, iops=479, runt= 30001msec
read : io=371MiB, bw=12,966KiB/s, iops=395, runt= 30040msec
read : io=325MiB, bw=11,321KiB/s, iops=345, runt= 30099msec
read : io=296MiB, bw=10,332KiB/s, iops=315, runt= 30086msec
read : io=238MiB, bw=8,319KiB/s, iops=253, runt= 30056msec
read : io=152MiB, bw=5,290KiB/s, iops=161, runt= 30070msec
read : io=100MiB, bw=3,483KiB/s, iops=106, runt= 30020msec
read : io=68,832KiB, bw=2,340KiB/s, iops=71, runt= 30118msec
READ: io=2,000MiB, aggrb=69,631KiB/s, minb=2,340KiB/s, maxb=15,727KiB/s,
mint=30001msec, maxt=30118msec

read : io=450MiB, bw=15,691KiB/s, iops=478, runt= 30068msec
read : io=369MiB, bw=12,882KiB/s, iops=393, runt= 30032msec
read : io=364MiB, bw=12,732KiB/s, iops=388, runt= 30015msec
read : io=283MiB, bw=9,889KiB/s, iops=301, runt= 30002msec
read : io=228MiB, bw=7,935KiB/s, iops=242, runt= 30091msec
read : io=144MiB, bw=5,018KiB/s, iops=153, runt= 30103msec
read : io=97,760KiB, bw=3,327KiB/s, iops=101, runt= 30083msec
read : io=66,784KiB, bw=2,276KiB/s, iops=69, runt= 30046msec
READ: io=1,999MiB, aggrb=69,625KiB/s, minb=2,276KiB/s, maxb=15,691KiB/s,
mint=30002msec, maxt=30103msec

IO controller kernel, Three sets
--------------------------------
read : io=404MiB, bw=14,103KiB/s, iops=430, runt= 30072msec
read : io=344MiB, bw=11,999KiB/s, iops=366, runt= 30035msec
read : io=294MiB, bw=10,257KiB/s, iops=313, runt= 30052msec
read : io=254MiB, bw=8,888KiB/s, iops=271, runt= 30021msec
read : io=238MiB, bw=8,311KiB/s, iops=253, runt= 30086msec
read : io=177MiB, bw=6,202KiB/s, iops=189, runt= 30001msec
read : io=158MiB, bw=5,517KiB/s, iops=168, runt= 30118msec
read : io=99MiB, bw=3,464KiB/s, iops=105, runt= 30107msec
READ: io=1,971MiB, aggrb=68,604KiB/s, minb=3,464KiB/s, maxb=14,103KiB/s,
mint=30001msec, maxt=30118msec

read : io=375MiB, bw=13,066KiB/s, iops=398, runt= 30110msec
read : io=326MiB, bw=11,409KiB/s, iops=348, runt= 30003msec
read : io=308MiB, bw=10,758KiB/s, iops=328, runt= 30066msec
read : io=256MiB, bw=8,937KiB/s, iops=272, runt= 30091msec
read : io=232MiB, bw=8,088KiB/s, iops=246, runt= 30041msec
read : io=192MiB, bw=6,695KiB/s, iops=204, runt= 30077msec
read : io=144MiB, bw=5,014KiB/s, iops=153, runt= 30051msec
read : io=96,224KiB, bw=3,281KiB/s, iops=100, runt= 30026msec
READ: io=1,928MiB, aggrb=67,145KiB/s, minb=3,281KiB/s, maxb=13,066KiB/s,
mint=30003msec, maxt=30110msec

read : io=405MiB, bw=14,162KiB/s, iops=432, runt= 30021msec
read : io=354MiB, bw=12,386KiB/s, iops=378, runt= 30007msec
read : io=303MiB, bw=10,567KiB/s, iops=322, runt= 30062msec
read : io=261MiB, bw=9,126KiB/s, iops=278, runt= 30040msec
read : io=228MiB, bw=7,946KiB/s, iops=242, runt= 30048msec
read : io=178MiB, bw=6,222KiB/s, iops=189, runt= 30074msec
read : io=152MiB, bw=5,286KiB/s, iops=161, runt= 30093msec
read : io=99MiB, bw=3,446KiB/s, iops=105, runt= 30110msec
READ: io=1,981MiB, aggrb=68,996KiB/s, minb=3,446KiB/s, maxb=14,162KiB/s,
mint=30007msec, maxt=30110msec

Notes:
- It looks like overall throughput is 1-3% less in case of io controller.
- Bandwidth distribution between various prio levels has changed a bit. CFQ
  seems to have 100ms slice length for prio4 and then this slice increases
  by 20% for each prio level as prio increases and decreases by 20% as prio
  levels decrease. So Io controller does not seem to be doing too bad as in
  meeting that distribution.

Group Fairness
+++++++++++++++
Test7 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test8 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

  Higher weight dd finishes first and at that point of time my script takes
  care of reading cgroup files io.disk_time and io.disk_sectors for both the
  groups and display the results.

  dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
  dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

  group1 time=8:16 2452 group1 sectors=8:16 457856
  group2 time=8:16 1317 group2 sectors=8:16 247008

  234179072 bytes (234 MB) copied, 3.90912 s, 59.9 MB/s
  234179072 bytes (234 MB) copied, 5.15548 s, 45.4 MB/s

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test9 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.

First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.

sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &

Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)

Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.disk_time and io.disk_sectors cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for. 

For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".

sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo  1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------

Results
=======
268435456 bytes (268 MB) copied, 6.92248 s, 38.8 MB/s

group1 time=8:16 3185 group1 sectors=8:16 524824
group2 time=8:16 3190 group2 sectors=8:16 503848

Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.

Test10 (AIO)
===========

AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight

fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Results
------

test1 statistics: time=8:16 17955   sectors=8:16 1049656 dq=8:16 2
test2 statistics: time=8:16 9217   sectors=8:16 602592 dq=8:16 1

Above shows that by the time first fio (higher weight), finished, group
test1 got 17686 ms of disk time and group test2 got 9036 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"

echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Following are the results.

test1 statistics: time=8:16 25452   sectors=8:16 1049664 dq=8:16 2
test2 statistics: time=8:16 12939   sectors=8:16 532184 dq=8:16 4

Above shows that by the time first fio (higher weight), finished, group
test1 got almost double the disk time of group test2.

Test11 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204

Thanks
Vivek

^ permalink raw reply	[flat|nested] 322+ messages in thread

end of thread, other threads:[~2009-09-24  1:12 UTC | newest]

Thread overview: 322+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-28 21:30 [RFC] IO scheduler based IO controller V9 Vivek Goyal
2009-08-28 21:30 ` Vivek Goyal
2009-08-28 21:30 ` [PATCH 01/23] io-controller: Documentation Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-28 21:30 ` [PATCH 02/23] io-controller: Core of the elevator fair queuing Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-28 22:26   ` Rik van Riel
2009-08-28 22:26     ` Rik van Riel
     [not found]   ` <1251495072-7780-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-28 22:26     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-29  1:29   ` Rik van Riel
2009-08-29  1:29     ` Rik van Riel
     [not found]   ` <1251495072-7780-4-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29  1:29     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-29  1:44   ` Rik van Riel
2009-08-29  1:44     ` Rik van Riel
     [not found]   ` <1251495072-7780-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29  1:44     ` Rik van Riel
     [not found] ` <1251495072-7780-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-28 21:30   ` [PATCH 01/23] io-controller: Documentation Vivek Goyal
2009-08-28 21:30   ` [PATCH 02/23] io-controller: Core of the elevator fair queuing Vivek Goyal
2009-08-28 21:30   ` [PATCH 03/23] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-08-28 21:30   ` [PATCH 04/23] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-08-28 21:30   ` [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
2009-08-28 21:30   ` [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-08-28 21:30   ` [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-08-28 21:30   ` [PATCH 08/23] io-controller: cfq changes to use " Vivek Goyal
2009-08-28 21:30   ` [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-08-28 21:30   ` [PATCH 10/23] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-08-28 21:31   ` [PATCH 11/23] io-controller: Introduce group idling Vivek Goyal
2009-08-28 21:31   ` [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-08-28 21:31   ` [PATCH 13/23] io-controller: Separate out queue and data Vivek Goyal
2009-08-28 21:31   ` [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-08-28 21:31   ` [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-08-28 21:31   ` [PATCH 16/23] io-controller: deadline " Vivek Goyal
2009-08-28 21:31   ` [PATCH 17/23] io-controller: anticipatory " Vivek Goyal
2009-08-28 21:31   ` [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-08-28 21:31   ` [PATCH 19/23] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-08-28 21:31   ` [PATCH 20/23] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-08-28 21:31   ` [PATCH 21/23] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-08-28 21:31   ` [PATCH 22/23] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-08-28 21:31   ` [PATCH 23/23] io-controller: debug elevator fair queuing support Vivek Goyal
2009-08-31  1:09   ` [RFC] IO scheduler based IO controller V9 Gui Jianfeng
2009-09-02  0:58   ` Gui Jianfeng
2009-09-07  7:40   ` Gui Jianfeng
2009-09-08 22:28   ` Vivek Goyal
2009-09-08 22:28   ` [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle Vivek Goyal
2009-09-08 22:28   ` [PATCH 25/23] io-controller: fix queue vs group fairness Vivek Goyal
2009-09-08 22:28   ` [PATCH 26/23] io-controller: fix writer preemption with in a group Vivek Goyal
2009-09-10 15:18   ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
2009-08-28 21:30 ` [PATCH 05/23] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-29  3:31   ` Rik van Riel
2009-08-29  3:31     ` Rik van Riel
     [not found]   ` <1251495072-7780-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29  3:31     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 06/23] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
     [not found]   ` <1251495072-7780-7-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29  3:37     ` Rik van Riel
2009-08-29  3:37   ` Rik van Riel
2009-08-29  3:37     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 07/23] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-29 23:04   ` Rik van Riel
2009-08-29 23:04     ` Rik van Riel
2009-09-03  3:08   ` Munehiro Ikeda
2009-09-03  3:08     ` Munehiro Ikeda
     [not found]     ` <4A9F3319.8040509-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-09-10 20:11       ` Vivek Goyal
2009-09-10 20:11     ` Vivek Goyal
2009-09-10 20:11       ` Vivek Goyal
     [not found]   ` <1251495072-7780-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29 23:04     ` Rik van Riel
2009-09-03  3:08     ` Munehiro Ikeda
2009-08-28 21:30 ` [PATCH 08/23] io-controller: cfq changes to use " Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
2009-08-29 23:11   ` Rik van Riel
2009-08-29 23:11     ` Rik van Riel
     [not found]   ` <1251495072-7780-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29 23:11     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 09/23] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
     [not found]   ` <1251495072-7780-10-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-29 23:12     ` Rik van Riel
2009-08-29 23:12   ` Rik van Riel
2009-08-29 23:12     ` Rik van Riel
2009-08-28 21:30 ` [PATCH 10/23] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-08-28 21:30   ` Vivek Goyal
     [not found]   ` <1251495072-7780-11-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-30  0:10     ` Rik van Riel
2009-08-30  0:10   ` Rik van Riel
2009-08-30  0:10     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 11/23] io-controller: Introduce group idling Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-30  0:38   ` Rik van Riel
2009-08-30  0:38     ` Rik van Riel
     [not found]   ` <1251495072-7780-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-30  0:38     ` Rik van Riel
2009-09-18  3:56     ` [PATCH] io-controller: Fix another bug that causing system hanging Gui Jianfeng
2009-09-18  3:56   ` Gui Jianfeng
2009-09-18  3:56     ` Gui Jianfeng
2009-09-18 14:47     ` Vivek Goyal
2009-09-18 14:47       ` Vivek Goyal
     [not found]     ` <4AB30508.6010206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-18 14:47       ` Vivek Goyal
2009-08-28 21:31 ` [PATCH 12/23] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
     [not found]   ` <1251495072-7780-13-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-30  0:40     ` Rik van Riel
2009-08-30  0:40   ` Rik van Riel
2009-08-30  0:40     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 13/23] io-controller: Separate out queue and data Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31 15:27   ` Rik van Riel
2009-08-31 15:27     ` Rik van Riel
     [not found]   ` <1251495072-7780-14-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 15:27     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 14/23] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31  2:49   ` Rik van Riel
2009-08-31  2:49     ` Rik van Riel
     [not found]   ` <1251495072-7780-15-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31  2:49     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 15/23] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31  2:52   ` Rik van Riel
2009-08-31  2:52     ` Rik van Riel
2009-09-10 17:32     ` Vivek Goyal
2009-09-10 17:32       ` Vivek Goyal
     [not found]     ` <4A9B3B0B.9090009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-10 17:32       ` Vivek Goyal
     [not found]   ` <1251495072-7780-16-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31  2:52     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 16/23] io-controller: deadline " Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31  3:13   ` Rik van Riel
2009-08-31  3:13     ` Rik van Riel
2009-08-31 13:46     ` Vivek Goyal
2009-08-31 13:46       ` Vivek Goyal
     [not found]     ` <4A9B3FD3.6000407-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 13:46       ` Vivek Goyal
     [not found]   ` <1251495072-7780-17-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31  3:13     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 17/23] io-controller: anticipatory " Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
     [not found]   ` <1251495072-7780-18-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 17:21     ` Rik van Riel
2009-08-31 17:21   ` Rik van Riel
2009-08-31 17:21     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 18/23] io-controller: blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
     [not found]   ` <1251495072-7780-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 17:34     ` Rik van Riel
2009-08-31 17:34   ` Rik van Riel
2009-08-31 17:34     ` Rik van Riel
2009-08-31 18:56     ` Vivek Goyal
2009-08-31 18:56       ` Vivek Goyal
2009-08-31 23:51       ` Nauman Rafique
2009-08-31 23:51         ` Nauman Rafique
2009-09-01  7:00         ` Ryo Tsuruta
2009-09-01  7:00           ` Ryo Tsuruta
     [not found]           ` <20090901.160004.226800357.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-01 14:11             ` Vivek Goyal
2009-09-01 14:11           ` Vivek Goyal
2009-09-01 14:11             ` Vivek Goyal
2009-09-01 14:53             ` Rik van Riel
2009-09-01 14:53               ` Rik van Riel
2009-09-01 18:02             ` Nauman Rafique
2009-09-01 18:02               ` Nauman Rafique
     [not found]             ` <20090901141142.GA13709-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-01 14:53               ` Rik van Riel
2009-09-01 18:02               ` Nauman Rafique
2009-09-02  0:59               ` KAMEZAWA Hiroyuki
2009-09-02  9:52               ` Ryo Tsuruta
2009-09-02  0:59             ` KAMEZAWA Hiroyuki
2009-09-02  0:59               ` KAMEZAWA Hiroyuki
2009-09-02  3:12               ` Balbir Singh
2009-09-02  3:12                 ` Balbir Singh
     [not found]               ` <20090902095912.cdf8a55e.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-09-02  3:12                 ` Balbir Singh
2009-09-02  9:52             ` Ryo Tsuruta
     [not found]               ` <20090902.185251.193693849.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-02 13:58                 ` Vivek Goyal
2009-09-02 13:58               ` Vivek Goyal
2009-09-02 13:58                 ` Vivek Goyal
2009-09-03  2:24                 ` Ryo Tsuruta
2009-09-03  2:40                   ` Vivek Goyal
2009-09-03  2:40                     ` Vivek Goyal
     [not found]                     ` <20090903024014.GA8644-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-03  3:41                       ` Ryo Tsuruta
2009-09-03  3:41                     ` Ryo Tsuruta
2009-09-03  3:41                       ` Ryo Tsuruta
     [not found]                   ` <20090903.112423.226782505.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-03  2:40                     ` Vivek Goyal
     [not found]                 ` <20090902135821.GB5012-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-03  2:24                   ` Ryo Tsuruta
     [not found]         ` <e98e18940908311651s26de5b70ye6f4a82402956309-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-01  7:00           ` Ryo Tsuruta
     [not found]       ` <20090831185640.GF3758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 23:51         ` Nauman Rafique
     [not found]     ` <4A9C09BE.4060404-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 18:56       ` Vivek Goyal
2009-08-28 21:31 ` [PATCH 19/23] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
     [not found]   ` <1251495072-7780-20-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 17:39     ` Rik van Riel
2009-08-31 17:39   ` Rik van Riel
2009-08-31 17:39     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 20/23] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31 17:54   ` Rik van Riel
2009-08-31 17:54     ` Rik van Riel
2009-09-14 18:33   ` Nauman Rafique
2009-09-14 18:33     ` Nauman Rafique
     [not found]     ` <e98e18940909141133m5186b780r3215ce15141e4f87-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-16 18:47       ` Vivek Goyal
2009-09-16 18:47     ` Vivek Goyal
2009-09-16 18:47       ` Vivek Goyal
     [not found]   ` <1251495072-7780-21-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 17:54     ` Rik van Riel
2009-09-14 18:33     ` Nauman Rafique
2009-08-28 21:31 ` [PATCH 21/23] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31 19:49   ` Rik van Riel
2009-08-31 19:49     ` Rik van Riel
     [not found]   ` <1251495072-7780-22-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 19:49     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 22/23] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
     [not found]   ` <1251495072-7780-23-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 20:56     ` Rik van Riel
2009-08-31 20:56   ` Rik van Riel
2009-08-31 20:56     ` Rik van Riel
2009-08-28 21:31 ` [PATCH 23/23] io-controller: debug elevator fair queuing support Vivek Goyal
2009-08-28 21:31   ` Vivek Goyal
2009-08-31 20:57   ` Rik van Riel
2009-08-31 20:57     ` Rik van Riel
2009-08-31 21:01     ` Vivek Goyal
2009-08-31 21:01       ` Vivek Goyal
2009-08-31 21:12       ` Rik van Riel
2009-08-31 21:12         ` Rik van Riel
     [not found]       ` <20090831210154.GA8229-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 21:12         ` Rik van Riel
     [not found]     ` <4A9C3951.8020302-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 21:01       ` Vivek Goyal
     [not found]   ` <1251495072-7780-24-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-31 20:57     ` Rik van Riel
2009-08-31  1:09 ` [RFC] IO scheduler based IO controller V9 Gui Jianfeng
2009-08-31  1:09   ` Gui Jianfeng
2009-09-02  0:58 ` Gui Jianfeng
2009-09-02  0:58   ` Gui Jianfeng
2009-09-02 13:45   ` Vivek Goyal
2009-09-02 13:45     ` Vivek Goyal
     [not found]   ` <4A9DC33E.6000408-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-02 13:45     ` Vivek Goyal
2009-09-07  2:14     ` Gui Jianfeng
2009-09-07  2:14   ` Gui Jianfeng
2009-09-07  2:14     ` Gui Jianfeng
     [not found]     ` <4AA46C6E.4010109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-08 13:55       ` Vivek Goyal
2009-09-08 13:55     ` Vivek Goyal
2009-09-08 13:55       ` Vivek Goyal
2009-09-07  7:40 ` Gui Jianfeng
2009-09-07  7:40   ` Gui Jianfeng
2009-09-08 13:53   ` Vivek Goyal
2009-09-08 13:53     ` Vivek Goyal
     [not found]   ` <4AA4B905.8010801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-08 13:53     ` Vivek Goyal
2009-09-08 19:19     ` Vivek Goyal
2009-09-08 19:19   ` Vivek Goyal
2009-09-08 19:19     ` Vivek Goyal
2009-09-09  7:38     ` Gui Jianfeng
2009-09-09  7:38       ` Gui Jianfeng
     [not found]       ` <4AA75B71.5060109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-09 15:05         ` Vivek Goyal
2009-09-09 15:05       ` Vivek Goyal
2009-09-09 15:05         ` Vivek Goyal
2009-09-10  3:20         ` Gui Jianfeng
     [not found]         ` <20090909150537.GD8256-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-10  3:20           ` Gui Jianfeng
2009-09-11  1:15           ` [PATCH] io-controller: Fix task hanging when there are more than one groups Gui Jianfeng
2009-09-11  1:15         ` Gui Jianfeng
2009-09-14  2:44           ` Vivek Goyal
2009-09-14  2:44             ` Vivek Goyal
     [not found]           ` <4AA9A4BE.30005-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-14  2:44             ` Vivek Goyal
2009-09-15  3:37             ` Vivek Goyal
2009-09-15  3:37           ` Vivek Goyal
2009-09-15  3:37             ` Vivek Goyal
2009-09-16  0:05             ` Gui Jianfeng
2009-09-16  0:05               ` Gui Jianfeng
     [not found]             ` <20090915033739.GA4054-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-16  0:05               ` Gui Jianfeng
2009-09-16  2:58               ` Gui Jianfeng
2009-09-24  1:10               ` Gui Jianfeng
2009-09-16  2:58             ` Gui Jianfeng
2009-09-16 18:09               ` Vivek Goyal
2009-09-16 18:09                 ` Vivek Goyal
2009-09-17  6:08                 ` Gui Jianfeng
2009-09-17  6:08                   ` Gui Jianfeng
     [not found]                 ` <20090916180915.GE5221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-17  6:08                   ` Gui Jianfeng
     [not found]               ` <4AB05442.6080004-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-09-16 18:09                 ` Vivek Goyal
2009-09-24  1:10             ` Gui Jianfeng
2009-09-09  9:41     ` [RFC] IO scheduler based IO controller V9 Jens Axboe
2009-09-09  9:41       ` Jens Axboe
     [not found]     ` <20090908191941.GF15974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-09  7:38       ` Gui Jianfeng
2009-09-09  9:41       ` Jens Axboe
2009-09-08 22:28 ` Vivek Goyal
2009-09-08 22:28   ` Vivek Goyal
2009-09-08 22:28 ` [PATCH 24/23] io-controller: Don't leave a queue active when a disk is idle Vivek Goyal
     [not found]   ` <20090908222821.GB3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-09  3:39     ` Rik van Riel
2009-09-09  3:39   ` Rik van Riel
2009-09-08 22:28 ` [PATCH 25/23] io-controller: fix queue vs group fairness Vivek Goyal
2009-09-08 22:28   ` Vivek Goyal
     [not found]   ` <20090908222827.GC3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-08 22:37     ` Daniel Walker
2009-09-08 23:13     ` Fabio Checconi
2009-09-09  4:44     ` Rik van Riel
2009-09-08 22:37   ` Daniel Walker
2009-09-09  1:09     ` Vivek Goyal
2009-09-09  1:09     ` Vivek Goyal
2009-09-09  1:09       ` Vivek Goyal
2009-09-08 23:13   ` Fabio Checconi
2009-09-09  1:32     ` Vivek Goyal
2009-09-09  1:32       ` Vivek Goyal
     [not found]       ` <20090909013205.GB3594-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-09  2:03         ` Fabio Checconi
2009-09-09  2:03       ` Fabio Checconi
     [not found]     ` <20090908231334.GJ17468-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-09-09  1:32       ` Vivek Goyal
2009-09-09  4:44   ` Rik van Riel
2009-09-09  4:44     ` Rik van Riel
2009-09-08 22:28 ` [PATCH 26/23] io-controller: fix writer preemption with in a group Vivek Goyal
2009-09-08 22:28   ` Vivek Goyal
     [not found]   ` <20090908222835.GD3558-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-09  4:59     ` Rik van Riel
2009-09-09  4:59   ` Rik van Riel
2009-09-09  4:59     ` Rik van Riel
2009-09-10 15:18 ` [RFC] IO scheduler based IO controller V9 Jerome Marchand
2009-09-10 20:52   ` Vivek Goyal
2009-09-10 20:52     ` Vivek Goyal
2009-09-10 20:56     ` Vivek Goyal
2009-09-10 20:56       ` Vivek Goyal
     [not found]       ` <20090910205657.GD3617-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-11 13:16         ` Jerome Marchand
2009-09-11 13:16       ` Jerome Marchand
2009-09-11 14:30         ` Vivek Goyal
2009-09-11 14:30           ` Vivek Goyal
     [not found]           ` <20090911143040.GB6758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-11 14:43             ` Vivek Goyal
2009-09-11 14:44             ` Jerome Marchand
2009-09-11 14:43           ` Vivek Goyal
2009-09-11 14:43             ` Vivek Goyal
     [not found]             ` <20090911144341.GC6758-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-11 14:55               ` Jerome Marchand
2009-09-11 14:55                 ` Jerome Marchand
     [not found]                 ` <4AAA64F6.2050800-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-11 15:01                   ` Vivek Goyal
2009-09-11 15:01                 ` Vivek Goyal
2009-09-11 15:01                   ` Vivek Goyal
2009-09-11 14:44           ` Jerome Marchand
     [not found]         ` <4AAA4DA7.8010909-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-11 14:30           ` Vivek Goyal
     [not found]     ` <20090910205227.GB3617-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-10 20:56       ` Vivek Goyal
2009-09-14 14:26       ` Jerome Marchand
2009-09-14 14:26         ` Jerome Marchand
2009-09-13 18:54   ` Vivek Goyal
2009-09-13 18:54     ` Vivek Goyal
     [not found]     ` <20090913185447.GA11003-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-14 14:31       ` Jerome Marchand
2009-09-14 14:31         ` Jerome Marchand
     [not found]   ` <4AA918C1.6070907-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-10 20:52     ` Vivek Goyal
2009-09-13 18:54     ` Vivek Goyal
  -- strict thread matches above, loose matches on Subject: below --
2009-08-28 21:30 Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.