linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] cfq + io priorities
@ 2003-11-10  1:49 Albert Cahalan
  2003-11-10 10:19 ` Herbert Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Albert Cahalan @ 2003-11-10  1:49 UTC (permalink / raw)
  To: linux-kernel mailing list; +Cc: axboe, guichaz

Jens Axboe writes:
> On Sun, Nov 09 2003, Guillaume Chazarain wrote:

>> A process has an assigned io nice level, anywhere
>> from 0 to 20. Both of
>>
>> OK, I ask THE question : why not using the normal
>> nice level, via current->static_prio ? This way,
>> cdrecord would be RT even in IO, and nice -19
>> updatedb would have a minimal impact on the system.
>
> I don't want to tie io prioritites to cpu
> priorities, that's a design decision.

Sure, but do it in a way that's friendly to
all the apps and admins that only know "nice".

nice_cpu   sets CPU niceness
nice_net   sets net niceness
nice_disk  sets disk niceness
...
nice       sets all niceness values at once

>>> these end values are "special" - 0 means the process
>>> is only allowed to do io if the disk is idle, and 20
>>> means the process io is considered
>>
>> So a process with ioprio == 0 can be forever
>> starved. As it's not
>
> Yes
>
>> done this way for nice -19 tasks (unlike FreeBSD),
>> wouldn't it be safer to give a very long deadline
>> to ioprio == 0 requests ?
>
> ioprio == 0 means idle IO. It follows from that that
> you can risk infinite starvation if other io is happening.
> Otherwise it would not be idle io :-)
>
> CFQ doesn't assign request deadlines. That would
> be another way of handling starvation.

Keeping IO niceness as similar to CPU niceness as
you can would be very good for admins.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10  1:49 [PATCH] cfq + io priorities Albert Cahalan
@ 2003-11-10 10:19 ` Herbert Xu
  2003-11-10 10:36   ` Ali Magik Rama
  2003-11-10 13:07   ` [PATCH] cfq + io priorities Albert Cahalan
  0 siblings, 2 replies; 23+ messages in thread
From: Herbert Xu @ 2003-11-10 10:19 UTC (permalink / raw)
  To: Albert Cahalan, linux-kernel

Albert Cahalan <albert@users.sf.net> wrote:
> 
> Sure, but do it in a way that's friendly to
> all the apps and admins that only know "nice".
> 
> nice_cpu   sets CPU niceness
> nice_net   sets net niceness
> nice_disk  sets disk niceness
> ...
> nice       sets all niceness values at once

That's a user space problem.  No matter what Jens does, you can always
make nice(1) do what you said.
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Ali Magik
  2003-11-10 10:19 ` Herbert Xu
@ 2003-11-10 10:36   ` Rama
  2003-11-10 13:07   ` [PATCH] cfq + io priorities Albert Cahalan
  1 sibling, 0 replies; 23+ messages in thread
From: Rama @ 2003-11-10 10:36 UTC (permalink / raw)
  Cc: linux-kernel

hello,


i have read the past post in archive, but i still cannot figure how in 
2.4.22 i can use ali magik ide with dma.
when i enable it via "hdparm" the system freeze after short.

in lilo, i have add in append the ide1=autotune hda=autotune, i have 
read somewhere that this can help in detection of hd.

there is somewhere a patch for 2.4.22 that i can use for this problem?

tnx for help!

best regards


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 10:19 ` Herbert Xu
  2003-11-10 10:36   ` Ali Magik Rama
@ 2003-11-10 13:07   ` Albert Cahalan
  2003-11-10 13:31     ` P
  2003-11-11 17:46     ` Toon van der Pas
  1 sibling, 2 replies; 23+ messages in thread
From: Albert Cahalan @ 2003-11-10 13:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-kernel mailing list

On Mon, 2003-11-10 at 05:19, Herbert Xu wrote:
> Albert Cahalan <albert@users.sf.net> wrote:
> > 
> > Sure, but do it in a way that's friendly to
> > all the apps and admins that only know "nice".
> > 
> > nice_cpu   sets CPU niceness
> > nice_net   sets net niceness
> > nice_disk  sets disk niceness
> > ...
> > nice       sets all niceness values at once
> 
> That's a user space problem.  No matter what Jens
> does, you can always make nice(1) do what you said.

It's not just the nice command. There's a syscall
interface you know, and lots of apps use it.

#include <unistd.h>
int nice(int inc);

You planning to hack ALL those apps? You'll
convince BSD-centric developers to include
this Linux-specific change?

Besides, the kernel load average was changed to
include processes waiting for IO. It just plain
makes sense to mix CPU usage with IO usage by
default. Wanting different niceness for CPU
and IO is a really unusual thing.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 13:07   ` [PATCH] cfq + io priorities Albert Cahalan
@ 2003-11-10 13:31     ` P
  2003-11-10 13:37       ` Jens Axboe
  2003-11-11 17:46     ` Toon van der Pas
  1 sibling, 1 reply; 23+ messages in thread
From: P @ 2003-11-10 13:31 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Herbert Xu, linux-kernel mailing list, axboe

Albert Cahalan wrote:
> Besides, the kernel load average was changed to
> include processes waiting for IO. It just plain
> makes sense to mix CPU usage with IO usage by
> default. Wanting different niceness for CPU
> and IO is a really unusual thing.

I strongly agree. Of course it would be
nice/necessary to have seperate nice values,
but setting the global one should set the
underlying ones (cpu, disk, ...) also.

Pádraig.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 13:31     ` P
@ 2003-11-10 13:37       ` Jens Axboe
  2003-11-10 13:57         ` P
  2003-11-10 23:52         ` Albert Cahalan
  0 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2003-11-10 13:37 UTC (permalink / raw)
  To: P; +Cc: Albert Cahalan, Herbert Xu, linux-kernel mailing list

On Mon, Nov 10 2003, P@draigBrady.com wrote:
> Albert Cahalan wrote:
> >Besides, the kernel load average was changed to
> >include processes waiting for IO. It just plain
> >makes sense to mix CPU usage with IO usage by
> >default. Wanting different niceness for CPU
> >and IO is a really unusual thing.
> 
> I strongly agree. Of course it would be
> nice/necessary to have seperate nice values,
> but setting the global one should set the
> underlying ones (cpu, disk, ...) also.

Global one? nice is CPU in Linux, period. ionice is io priority. I'm not
going to change this. So Albert and you can agree as much as you want,
unless you have some heavier arguments it's not going to help one bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 13:37       ` Jens Axboe
@ 2003-11-10 13:57         ` P
  2003-11-10 23:52         ` Albert Cahalan
  1 sibling, 0 replies; 23+ messages in thread
From: P @ 2003-11-10 13:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Albert Cahalan, Herbert Xu, linux-kernel mailing list

Jens Axboe wrote:
> On Mon, Nov 10 2003, P@draigBrady.com wrote:
> 
>>Albert Cahalan wrote:
>>
>>>Besides, the kernel load average was changed to
>>>include processes waiting for IO. It just plain
>>>makes sense to mix CPU usage with IO usage by
>>>default. Wanting different niceness for CPU
>>>and IO is a really unusual thing.
>>
>>I strongly agree. Of course it would be
>>nice/necessary to have seperate nice values,
>>but setting the global one should set the
>>underlying ones (cpu, disk, ...) also.
> 
> Global one? nice is CPU in Linux, period.

Currently this is what it actually does.
But functionally, high nice value (to me at least)
means this process is low priority on the system =>
other processes get more system resources, and
the fact that this doesn't apply to IO until
now is just a defect. Now I can see some advantage
to splitting the tunables but not requiring
a new interface to turn this functionalit on.
I would like the new interface to turn it off.
I.E. I would like:

nice
     cpu
     IO

whereas you would like:

really_nice
     nice
     ionice

> ionice is io priority. I'm not
> going to change this. So Albert and you can agree as much as you want,
> unless you have some heavier arguments it's not going to help one bit.

fair enough.

Pádraig.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 13:37       ` Jens Axboe
  2003-11-10 13:57         ` P
@ 2003-11-10 23:52         ` Albert Cahalan
  2003-11-18  9:27           ` Pavel Machek
  1 sibling, 1 reply; 23+ messages in thread
From: Albert Cahalan @ 2003-11-10 23:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: P, Albert Cahalan, Herbert Xu, linux-kernel mailing list

On Mon, 2003-11-10 at 08:37, Jens Axboe wrote:
> On Mon, Nov 10 2003, P@draigBrady.com wrote:
> > Albert Cahalan wrote:
> > >Besides, the kernel load average was changed to
> > >include processes waiting for IO. It just plain
> > >makes sense to mix CPU usage with IO usage by
> > >default. Wanting different niceness for CPU
> > >and IO is a really unusual thing.
> > 
> > I strongly agree. Of course it would be
> > nice/necessary to have seperate nice values,
> > but setting the global one should set the
> > underlying ones (cpu, disk, ...) also.
> 
> Global one? nice is CPU in Linux, period.
> ionice is io priority. I'm not going to
> change this. So Albert and you can agree
> as much as you want, unless you have some
> heavier arguments it's not going to help
> one bit.

In other words, you're ignoring at least 3
people because you can. Coder's right, etc.

FWIW, here's a heavier argument.

I studied the UNIX spec today. I find nothing
that says the "nice" value is only for %CPU.

Your objection has been raised before, when
Linus changed the load average computation
to include processes waiting for IO. Linus
argued that users were really interested in
the feel of the system, no matter what the
source of load. That applies here as well.

In other words, it's a bug that some sources
of load aren't handled (viewed, limited, etc.)
by the normal mechanisms.

I have at least 27 binaries on my system that
use nice() in an attempt to reduce system load
(as in xlock) or increase their ability to
get things done. I can't think of a case in
which a matching IO priority wouldn't be desired.
It sure sounds painful to modify all those
binaries to do what is really wanted, especially
for the ones that aren't Linux-specific and
managed with autoconf.

I suppose the nice() wrapper in glibc could
be modified... but that's kind of silly when
the kernel is getting modified anyway.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 13:07   ` [PATCH] cfq + io priorities Albert Cahalan
  2003-11-10 13:31     ` P
@ 2003-11-11 17:46     ` Toon van der Pas
  1 sibling, 0 replies; 23+ messages in thread
From: Toon van der Pas @ 2003-11-11 17:46 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel mailing list

On Mon, Nov 10, 2003 at 08:07:54AM -0500, Albert Cahalan wrote:
> On Mon, 2003-11-10 at 05:19, Herbert Xu wrote:
> > Albert Cahalan <albert@users.sf.net> wrote:
> > > 
> > > Sure, but do it in a way that's friendly to
> > > all the apps and admins that only know "nice".
> > > 
> > > nice_cpu   sets CPU niceness
> > > nice_net   sets net niceness
> > > nice_disk  sets disk niceness
> > > ...
> > > nice       sets all niceness values at once
> > 
> > That's a user space problem.  No matter what Jens
> > does, you can always make nice(1) do what you said.
> 
> It's not just the nice command. There's a syscall
> interface you know, and lots of apps use it.
> 
> #include <unistd.h>
> int nice(int inc);
> 
> You planning to hack ALL those apps? You'll
> convince BSD-centric developers to include
> this Linux-specific change?

Yes, I really would like all those apps to be hacked.
Because I want to REMOVE those nice() calls, not because I want to
add another piece of functionality that doesn't belong in apps in the
first place.

My opinion is that it's the system administrator's job to assign
nice-values to processes. Apps that do that themselves make the task
of the admin harder, if not impossible.

Regards,
Toon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-10 23:52         ` Albert Cahalan
@ 2003-11-18  9:27           ` Pavel Machek
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Machek @ 2003-11-18  9:27 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Jens Axboe, P, Albert Cahalan, Herbert Xu, linux-kernel mailing list

Hi!

> I suppose the nice() wrapper in glibc could
> be modified... but that's kind of silly when
> the kernel is getting modified anyway.

Perhaps modifying glibc is the right way after all. We do not want to
have nice(), ionice() and then cpunice() as a kernel interface.

OTOH it would be good to keep ionice() on the same scale as other
"nice" values so that "do-it-all" interface is easier.

								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-18 13:32         ` Jens Axboe
@ 2003-11-18 13:38           ` Pavel Machek
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Machek @ 2003-11-18 13:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Guillaume Chazarain, Linux Kernel

Hi!

> > At least idle class can not be used to hold important semaphore
> > forever (even low-priority prosses receive enough time not to hold
> > important semaphores too long)... I believe you should do the same (==
> > get rid of idle class for now, and clearly state that realtime ones
> > are not _guaranteed_ anything).
> 
> That's not doing something about it, that's giving up... 

:-) Yes. That's what we do for scheduler, already. [And its better to
give up than to have DoS security hole, right?]

> You could allow idle prio to proceed, if it holds a resource that could
> potentially block others.

I guess you can't push this for 2.6. And notice that we use same
solution for cpu scheduler, where solution is quite easy (with no
hot-paths overhead).
							Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-18 13:26       ` Pavel Machek
@ 2003-11-18 13:32         ` Jens Axboe
  2003-11-18 13:38           ` Pavel Machek
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2003-11-18 13:32 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Guillaume Chazarain, Linux Kernel

On Tue, Nov 18 2003, Pavel Machek wrote:
> > > If semaphore is held over disk io somewhere (quota code? journaling?)
> > > you have ugly possibility of priority inversion there.
> > 
> > Indeed yes. That's a general problem with all the io priorities though,
> > RT io might end up waiting for nice 10 io etc. Dunno what to do about
> > this yet...
> 
> Well, traditional (== scheduler) solution is not to have idle classes
> and not guarantee anything about realtime classes.
> 
> At least idle class can not be used to hold important semaphore
> forever (even low-priority prosses receive enough time not to hold
> important semaphores too long)... I believe you should do the same (==
> get rid of idle class for now, and clearly state that realtime ones
> are not _guaranteed_ anything).

That's not doing something about it, that's giving up... I'm not giving
any guarentees in the first place, just 'best effort'.

You could allow idle prio to proceed, if it holds a resource that could
potentially block others.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-17  8:14     ` Jens Axboe
@ 2003-11-18 13:26       ` Pavel Machek
  2003-11-18 13:32         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Machek @ 2003-11-18 13:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Guillaume Chazarain, Linux Kernel

Hi!

> > OTOH it might make sense to make "nice" command set
> > both by default.
> 
> Yes, I can probably be talked into that.

Good.

> > > > > these end values are "special" - 0 means the process is only allowed to
> > > > > do io if the disk is idle, and 20 means the process io is considered
> > > > 
> > > > So a process with ioprio == 0 can be forever starved. As it's not
> > > 
> > > Yes
> > 
> > If semaphore is held over disk io somewhere (quota code? journaling?)
> > you have ugly possibility of priority inversion there.
> 
> Indeed yes. That's a general problem with all the io priorities though,
> RT io might end up waiting for nice 10 io etc. Dunno what to do about
> this yet...

Well, traditional (== scheduler) solution is not to have idle classes
and not guarantee anything about realtime classes.

At least idle class can not be used to hold important semaphore
forever (even low-priority prosses receive enough time not to hold
important semaphores too long)... I believe you should do the same (==
get rid of idle class for now, and clearly state that realtime ones
are not _guaranteed_ anything).
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-13 12:54   ` Pavel Machek
  2003-11-16 22:56     ` Jakob Oestergaard
@ 2003-11-17  8:14     ` Jens Axboe
  2003-11-18 13:26       ` Pavel Machek
  1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2003-11-17  8:14 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Guillaume Chazarain, Linux Kernel

On Thu, Nov 13 2003, Pavel Machek wrote:
> Hi!
> 
> > > OK, I ask THE question : why not using the normal nice level, via
> > > current->static_prio ?
> > > This way, cdrecord would be RT even in IO, and nice -19 updatedb would have
> > > a minimal impact on the system.
> > 
> > I don't want to tie io prioritites to cpu priorities, that's a design
> > decision.
> 
> OTOH it might make sense to make "nice" command set
> both by default.

Yes, I can probably be talked into that.

> > > > these end values are "special" - 0 means the process is only allowed to
> > > > do io if the disk is idle, and 20 means the process io is considered
> > > 
> > > So a process with ioprio == 0 can be forever starved. As it's not
> > 
> > Yes
> 
> If semaphore is held over disk io somewhere (quota code? journaling?)
> you have ugly possibility of priority inversion there.

Indeed yes. That's a general problem with all the io priorities though,
RT io might end up waiting for nice 10 io etc. Dunno what to do about
this yet...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-13 12:54   ` Pavel Machek
@ 2003-11-16 22:56     ` Jakob Oestergaard
  2003-11-17  8:14     ` Jens Axboe
  1 sibling, 0 replies; 23+ messages in thread
From: Jakob Oestergaard @ 2003-11-16 22:56 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Jens Axboe, Guillaume Chazarain, Linux Kernel

On Thu, Nov 13, 2003 at 01:54:28PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > OK, I ask THE question : why not using the normal nice level, via
> > > current->static_prio ?
> > > This way, cdrecord would be RT even in IO, and nice -19 updatedb would have
> > > a minimal impact on the system.
> > 
> > I don't want to tie io prioritites to cpu priorities, that's a design
> > decision.
> 
> OTOH it might make sense to make "nice" command set
> both by default.

The syscall actually.

Users and developers alike, expect "nice" to mean "nice".

Having cpu_nice and io_nice too would be nice for completeness, if for
some unfathomable reason someone would want to set the one and not the
other.


All in my humble oppinion, of course  :)

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-09 11:39 ` Jens Axboe
@ 2003-11-13 12:54   ` Pavel Machek
  2003-11-16 22:56     ` Jakob Oestergaard
  2003-11-17  8:14     ` Jens Axboe
  0 siblings, 2 replies; 23+ messages in thread
From: Pavel Machek @ 2003-11-13 12:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Guillaume Chazarain, Linux Kernel

Hi!

> > OK, I ask THE question : why not using the normal nice level, via
> > current->static_prio ?
> > This way, cdrecord would be RT even in IO, and nice -19 updatedb would have
> > a minimal impact on the system.
> 
> I don't want to tie io prioritites to cpu priorities, that's a design
> decision.

OTOH it might make sense to make "nice" command set
both by default.

> > > these end values are "special" - 0 means the process is only allowed to
> > > do io if the disk is idle, and 20 means the process io is considered
> > 
> > So a process with ioprio == 0 can be forever starved. As it's not
> 
> Yes

If semaphore is held over disk io somewhere (quota code? journaling?)
you have ugly possibility of priority inversion there.

> > Thanks for making something I have been dreaming of for a long time :)
> 
> Me too :)

Yep, another thanx from me...
-- 
				Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-09 21:30 ` Shailabh Nagar
@ 2003-11-09 21:34   ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2003-11-09 21:34 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: Linux Kernel, ckrm-tech

On Sun, Nov 09 2003, Shailabh Nagar wrote:
> >A process has an assigned io nice level, anywhere from 0 to 20. Both of
> >these end values are "special" - 0 means the process is only allowed to
> >do io if the disk is idle, and 20 means the process io is considered
> >realtime. Realtime IO always gets first access to the disk. 
> >
> 
> >Values from 1 to 19 assign 5-95% of disk bandwidth to that process. Any io 
> >class is
> >allowed to use all of disk bandwidth in absence of higher priority io.
> > 
> >
> Currently, cfq is doing bandwidth allocation in terms of  number of 
> requests, not bytes. Hence priority inversion can happen if lower 
> priority levels submit larger requests on an average. Any plans to take 
> request sizes into consideration  in future ? 

Yes that needs to be taken into account as well. I'll get another
version out soonish that works around that too.

> Of course, request sizes alone don't determine actual disk bandwidth 
> consumed  since their seek position also matters.

Yeah that's where it gets tricky. It's basically impossible to get
absolutely right, it will always be just guidelines. I don't want to
over complicate matters.

> >About the patch: stuff like this really needs some resource management
> >abstraction like CKRM. Right now we just look at the tgid of the
> >process. 
> >
> Now thats music to our ears :-)  Though you've complicated matters by 
> calling the priority level a "class" ! Please consider renaming
> class  to something else  (say priolevel ).

Done.

> Thanks for separating the hashvalue as a macro. It should make it even 
> easier to convert cfq to use a  CKRM I/O classes ' priority
> rather than the submitting task's ioprio value.

Yup that was my intention, to make the transition as smooth as possible.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-08 12:47 Jens Axboe
  2003-11-08 13:25 ` Jens Axboe
@ 2003-11-09 21:30 ` Shailabh Nagar
  2003-11-09 21:34   ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Shailabh Nagar @ 2003-11-09 21:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, ckrm-tech





Jens Axboe wrote:

>I've implemented IO nice levels in the CFQ io scheduler. 
>
Thanks for putting this in ! It'll be very useful to have some control 
over I/O priorities.


>It works as
>follows.
>
>A process has an assigned io nice level, anywhere from 0 to 20. Both of
>these end values are "special" - 0 means the process is only allowed to
>do io if the disk is idle, and 20 means the process io is considered
>realtime. Realtime IO always gets first access to the disk. 
>

>Values from 1 to 19 assign 5-95% of disk bandwidth to that process. Any io class is
>allowed to use all of disk bandwidth in absence of higher priority io.
>  
>
Currently, cfq is doing bandwidth allocation in terms of  number of 
requests, not bytes. Hence priority inversion can happen if lower 
priority levels submit larger requests on an average. Any plans to take 
request sizes into consideration  in future ? 

Of course, request sizes alone don't determine actual disk bandwidth 
consumed  since their seek position also matters.

>About the patch: stuff like this really needs some resource management
>abstraction like CKRM. Right now we just look at the tgid of the
>process. 
>
Now thats music to our ears :-)  Though you've complicated matters by 
calling the priority level a "class" ! Please consider renaming
class  to something else  (say priolevel ).

Thanks for separating the hashvalue as a macro. It should make it even 
easier to convert cfq to use a  CKRM I/O classes ' priority
rather than the submitting task's ioprio value.

-- Shailabh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-09 10:57 Guillaume Chazarain
@ 2003-11-09 11:39 ` Jens Axboe
  2003-11-13 12:54   ` Pavel Machek
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2003-11-09 11:39 UTC (permalink / raw)
  To: Guillaume Chazarain; +Cc: Linux Kernel

On Sun, Nov 09 2003, Guillaume Chazarain wrote:
> > A process has an assigned io nice level, anywhere from 0 to 20. Both of
> 
> OK, I ask THE question : why not using the normal nice level, via
> current->static_prio ?
> This way, cdrecord would be RT even in IO, and nice -19 updatedb would have
> a minimal impact on the system.

I don't want to tie io prioritites to cpu priorities, that's a design
decision.

> > these end values are "special" - 0 means the process is only allowed to
> > do io if the disk is idle, and 20 means the process io is considered
> 
> So a process with ioprio == 0 can be forever starved. As it's not

Yes

> done this way for nice -19 tasks (unlike FreeBSD), wouldn't it be
> safer to give a very long deadline to ioprio == 0 requests ?

ioprio == 0 means idle IO. It follows from that that you can risk
infinite starvation if other io is happening. Otherwise it would not be
idle io :-)

CFQ doesn't assign request deadlines. That would be another way of
handling starvation.

> Thanks for making something I have been dreaming of for a long time :)

Me too :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
@ 2003-11-09 10:57 Guillaume Chazarain
  2003-11-09 11:39 ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Guillaume Chazarain @ 2003-11-09 10:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

> A process has an assigned io nice level, anywhere from 0 to 20. Both of

OK, I ask THE question : why not using the normal nice level, via
current->static_prio ?
This way, cdrecord would be RT even in IO, and nice -19 updatedb would have
a minimal impact on the system.

> these end values are "special" - 0 means the process is only allowed to
> do io if the disk is idle, and 20 means the process io is considered

So a process with ioprio == 0 can be forever starved. As it's not
done this way for nice -19 tasks (unlike FreeBSD), wouldn't it be
safer to give a very long deadline to ioprio == 0 requests ?


Thanks for making something I have been dreaming of for a long time :)


Guillaume





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-08 13:25 ` Jens Axboe
@ 2003-11-08 14:06   ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2003-11-08 14:06 UTC (permalink / raw)
  To: Linux Kernel

Hi,

And also a new cfq, of course a few embarassing bugs were introduced
right before diffing...

I also forgot to include a small intro to the tunables that cfq-prio
has, so I'll do that here too. You can find these in
/sys/block/<disk>/queue/iosched.

quantum
	A batch quantum, an io class is only allowed to submit as much
	as this number of io requests per turn. Default is 4.

idle_quantum
	Like quantum, but for idle io. Default is 1.

queued
	A process is able to submit at least this number of io requests,
	regardless. Default is 4.

grace_rt
	The realtime grace period, in jiffies (I'll add user -> kernel
	conversions later, for now you need to know the clock rate of
	your arch. Disk is kept idle this long after realtime io
	completes, to allow more realtime io. Default is 10 ms.

grace_idle
	Disk must have been idle for this long before idle io is allowed
	to be sent to disk. Default is 100ms.

Additionally, should any "normal" io be pending when realtime io is
submitted, it's pulled off the dispatch list and put back into the io
scheduler. Should any idle io be on the dispatch list when normal or rt
io is submitted, that too is pulled back into the scheduler.

===== arch/i386/kernel/entry.S 1.69 vs edited =====
--- 1.69/arch/i386/kernel/entry.S	Wed Oct  1 15:53:17 2003
+++ edited/arch/i386/kernel/entry.S	Sat Nov  8 11:53:19 2003
@@ -880,5 +880,7 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
+	.long sys_ioprio_set
+	.long sys_ioprio_get
 
 nr_syscalls=(.-sys_call_table)/4
===== arch/ppc/kernel/misc.S 1.49 vs edited =====
--- 1.49/arch/ppc/kernel/misc.S	Fri Sep 12 18:26:52 2003
+++ edited/arch/ppc/kernel/misc.S	Sat Nov  8 11:53:19 2003
@@ -1385,3 +1385,5 @@
 	.long sys_statfs64
 	.long sys_fstatfs64
 	.long ppc_fadvise64_64
+	.long sys_ioprio_set
+	.long sys_ioprio_get
===== drivers/block/Kconfig.iosched 1.2 vs edited =====
--- 1.2/drivers/block/Kconfig.iosched	Fri Aug 15 03:16:57 2003
+++ edited/drivers/block/Kconfig.iosched	Sat Nov  8 11:55:04 2003
@@ -27,3 +27,11 @@
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_CFQ
+	bool "CFQ I/O scheduler" if EMBEDDED
+	default y
+	---help---
+	  The CFQ I/O scheduler tries to distribute bandwidth equally
+	  among all processes in the system, with the option of assigning
+	  io priorities a process or process group. It should provide a fair
+	  working environment, suitable for desktop systems.
===== drivers/block/Makefile 1.22 vs edited =====
--- 1.22/drivers/block/Makefile	Thu Oct 16 06:38:46 2003
+++ edited/drivers/block/Makefile	Sat Nov  8 11:53:19 2003
@@ -18,6 +18,7 @@
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
+obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_MAC_FLOPPY)	+= swim3.o
 obj-$(CONFIG_BLK_DEV_FD)	+= floppy.o
 obj-$(CONFIG_BLK_DEV_FD98)	+= floppy98.o
===== drivers/block/elevator.c 1.52 vs edited =====
--- 1.52/drivers/block/elevator.c	Fri Sep  5 12:13:05 2003
+++ edited/drivers/block/elevator.c	Sat Nov  8 11:53:19 2003
@@ -302,6 +302,14 @@
 		e->elevator_put_req_fn(q, rq);
 }
 
+void elv_set_congested(request_queue_t *q)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_set_congested_fn)
+		e->elevator_set_congested_fn(q);
+}
+
 int elv_may_queue(request_queue_t *q, int rw)
 {
 	elevator_t *e = &q->elevator;
@@ -309,7 +317,7 @@
 	if (e->elevator_may_queue_fn)
 		return e->elevator_may_queue_fn(q, rw);
 
-	return 0;
+	return 1;
 }
 
 void elv_completed_request(request_queue_t *q, struct request *rq)
===== drivers/block/ll_rw_blk.c 1.221 vs edited =====
--- 1.221/drivers/block/ll_rw_blk.c	Thu Nov  6 00:11:47 2003
+++ edited/drivers/block/ll_rw_blk.c	Sat Nov  8 14:49:51 2003
@@ -1331,7 +1331,9 @@
 static int __make_request(request_queue_t *, struct bio *);
 
 static elevator_t *chosen_elevator =
-#if defined(CONFIG_IOSCHED_AS)
+#if defined(CONFIG_IOSCHED_CFQ)
+	&iosched_cfq;
+#elif defined(CONFIG_IOSCHED_AS)
 	&iosched_as;
 #elif defined(CONFIG_IOSCHED_DEADLINE)
 	&iosched_deadline;
@@ -1353,6 +1355,10 @@
 	if (!strcmp(str, "as"))
 		chosen_elevator = &iosched_as;
 #endif
+#ifdef CONFIG_IOSCHED_CFQ
+	if (!strcmp(str, "cfq"))
+		chosen_elevator = &iosched_cfq;
+#endif
 #ifdef CONFIG_IOSCHED_NOOP
 	if (!strcmp(str, "noop"))
 		chosen_elevator = &elevator_noop;
@@ -1553,6 +1559,10 @@
 	struct io_context *ioc = get_io_context(gfp_mask);
 
 	spin_lock_irq(q->queue_lock);
+
+	if (!elv_may_queue(q, rw))
+		goto out_lock;
+
 	if (rl->count[rw]+1 >= q->nr_requests) {
 		/*
 		 * The queue will fill after this allocation, so set it as
@@ -1566,15 +1576,12 @@
 		}
 	}
 
-	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
-		/*
-		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
-		 */
-		spin_unlock_irq(q->queue_lock);
-		goto out;
-	}
+	/*
+	 * The queue is full and the allocating process is not a
+	 * "batcher", and not exempted by the IO scheduler
+	 */
+	if (blk_queue_full(q, rw) && !ioc_batching(ioc))
+		goto out_lock;
 
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
@@ -1592,8 +1599,7 @@
 		 */
 		spin_lock_irq(q->queue_lock);
 		freed_request(q, rw);
-		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_lock;
 	}
 
 	if (ioc_batching(ioc))
@@ -1622,6 +1628,11 @@
 out:
 	put_io_context(ioc);
 	return rq;
+out_lock:
+	if (!rq)
+		elv_set_congested(q);
+	spin_unlock_irq(q->queue_lock);
+	goto out;
 }
 
 /*
@@ -2936,3 +2947,21 @@
 		kobject_put(&disk->kobj);
 	}
 }
+
+asmlinkage int sys_ioprio_set(int ioprio)
+{
+	if (ioprio < IOPRIO_IDLE || ioprio > IOPRIO_RT)
+		return -EINVAL;
+	if (ioprio == IOPRIO_RT && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	printk("%s: set ioprio %d\n", current->comm, ioprio);
+	current->ioprio = ioprio;
+	return 0;
+}
+
+asmlinkage int sys_ioprio_get(void)
+{
+	return current->ioprio;
+}
+
===== include/asm-i386/unistd.h 1.30 vs edited =====
--- 1.30/include/asm-i386/unistd.h	Thu Oct  2 09:12:21 2003
+++ edited/include/asm-i386/unistd.h	Sat Nov  8 11:53:20 2003
@@ -279,8 +279,10 @@
 #define __NR_utimes		271
 #define __NR_fadvise64_64	272
 #define __NR_vserver		273
+#define __NR_ioprio_set		274
+#define __NR_ioprio_get		275
 
-#define NR_syscalls 274
+#define NR_syscalls 276
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
===== include/asm-ppc/unistd.h 1.26 vs edited =====
--- 1.26/include/asm-ppc/unistd.h	Sat Aug 23 04:15:18 2003
+++ edited/include/asm-ppc/unistd.h	Sat Nov  8 11:53:20 2003
@@ -259,8 +259,10 @@
 #define __NR_statfs64		252
 #define __NR_fstatfs64		253
 #define __NR_fadvise64_64	254
+#define __NR_ioprio_set		255
+#define __NR_ioprio_get		256
 
-#define __NR_syscalls		255
+#define __NR_syscalls		257
 
 #define __NR(n)	#n
 
===== include/asm-x86_64/unistd.h 1.18 vs edited =====
--- 1.18/include/asm-x86_64/unistd.h	Sun Oct  5 18:35:37 2003
+++ edited/include/asm-x86_64/unistd.h	Sat Nov  8 11:53:20 2003
@@ -532,8 +532,12 @@
 __SYSCALL(__NR_utimes, sys_utimes)
 #define __NR_vserver		236
 __SYSCALL(__NR_vserver, sys_ni_syscall)
+#define __NR_ioprio_set		237
+__SYSCALL(__NR_ioprio_set, sys_ioprio_set);
+#define __NR_ioprio_get		238
+__SYSCALL(__NR_ioprio_get, sys_ioprio_get);
 
-#define __NR_syscall_max __NR_vserver
+#define __NR_syscall_max __ioprio_get
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */
===== include/linux/elevator.h 1.29 vs edited =====
--- 1.29/include/linux/elevator.h	Sun Sep 21 23:50:12 2003
+++ edited/include/linux/elevator.h	Sat Nov  8 11:53:20 2003
@@ -17,6 +17,7 @@
 typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
 typedef void (elevator_completed_req_fn) (request_queue_t *, struct request *);
 typedef int (elevator_may_queue_fn) (request_queue_t *, int);
+typedef void (elevator_set_congested_fn) (request_queue_t *);
 
 typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
 typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
@@ -45,6 +46,7 @@
 	elevator_put_req_fn *elevator_put_req_fn;
 
 	elevator_may_queue_fn *elevator_may_queue_fn;
+	elevator_set_congested_fn *elevator_set_congested_fn;
 
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
@@ -74,6 +76,7 @@
 extern int elv_register_queue(request_queue_t *q);
 extern void elv_unregister_queue(request_queue_t *q);
 extern int elv_may_queue(request_queue_t *, int);
+extern void elv_set_congested(request_queue_t *);
 extern void elv_completed_request(request_queue_t *, struct request *);
 extern int elv_set_request(request_queue_t *, struct request *, int);
 extern void elv_put_request(request_queue_t *, struct request *);
@@ -93,6 +96,11 @@
  * anticipatory I/O scheduler
  */
 extern elevator_t iosched_as;
+
+/*
+ * completely fair queueing I/O scheduler
+ */
+extern elevator_t iosched_cfq;
 
 extern int elevator_init(request_queue_t *, elevator_t *);
 extern void elevator_exit(request_queue_t *);
===== include/linux/fs.h 1.274 vs edited =====
--- 1.274/include/linux/fs.h	Tue Sep 23 06:16:30 2003
+++ edited/include/linux/fs.h	Sat Nov  8 11:53:20 2003
@@ -1408,5 +1408,16 @@
 	return res;
 }
 
+/* io priorities */
+
+#define IOPRIO_NR      21
+
+#define IOPRIO_IDLE	0
+#define IOPRIO_NORM	10
+#define IOPRIO_RT	20
+
+asmlinkage int sys_ioprio_set(int ioprio);
+asmlinkage int sys_ioprio_get(void);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_FS_H */
===== include/linux/init_task.h 1.27 vs edited =====
--- 1.27/include/linux/init_task.h	Tue Aug 19 04:46:23 2003
+++ edited/include/linux/init_task.h	Sat Nov  8 11:53:20 2003
@@ -108,6 +108,7 @@
 	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.ioprio		= IOPRIO_NORM,					\
 }
 
 
===== include/linux/sched.h 1.174 vs edited =====
--- 1.174/include/linux/sched.h	Mon Oct 27 04:16:19 2003
+++ edited/include/linux/sched.h	Sat Nov  8 11:53:20 2003
@@ -462,6 +462,8 @@
 
 	struct io_context *io_context;
 
+	int ioprio;
+
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 };
===== kernel/fork.c 1.145 vs edited =====
--- 1.145/kernel/fork.c	Fri Oct 10 00:13:54 2003
+++ edited/kernel/fork.c	Sat Nov  8 11:53:20 2003
@@ -1046,6 +1046,7 @@
 	} else
 		link_pid(p, p->pids + PIDTYPE_TGID, &p->group_leader->pids[PIDTYPE_TGID].pid);
 
+	p->ioprio = current->ioprio;
 	nr_threads++;
 	write_unlock_irq(&tasklist_lock);
 	retval = 0;
--- /dev/null	2003-09-24 00:19:32.000000000 +0200
+++ linux-2.6-cfq/drivers/block/cfq-iosched.c	2003-11-08 14:55:41.000000000 +0100
@@ -0,0 +1,1169 @@
+/*
+ *  linux/drivers/block/cfq-iosched.c
+ *
+ *  CFQ, or complete fairness queueing, disk scheduler.
+ *
+ *  Based on ideas from a previously unfinished io
+ *  scheduler (round robin per-process disk scheduling) and Andrea Arcangeli.
+ *
+ *  IO priorities are supported, from 0% to 100% in 5% increments. Both of
+ *  those values have special meaning - 0% class is allowed to do io if
+ *  noone else wants to use the disk. 100% is considered real-time io, and
+ *  always get priority. Default process io rate is 95%. In absence of other
+ *  io, a class may consume 100% disk bandwidth regardless. Withing a class,
+ *  bandwidth is distributed equally among the citizens.
+ *
+ * TODO:
+ *	- cfq_select_requests() needs some work for 5-95% io
+ *	- barriers not supported
+ *
+ *  Copyright (C) 2003 Jens Axboe <axboe@suse.de>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/hash.h>
+#include <linux/rbtree.h>
+#include <linux/mempool.h>
+
+#if IOPRIO_NR > BITS_PER_LONG
+#error Cannot support this many io priority levels
+#endif
+
+/*
+ * tunables
+ */
+static int cfq_quantum = 4;
+static int cfq_idle_quantum = 1;
+static int cfq_queued = 4;
+static int cfq_grace_rt = HZ / 100 ?: 1;
+static int cfq_grace_idle = HZ / 10;
+
+#define CFQ_QHASH_SHIFT		6
+#define CFQ_QHASH_ENTRIES	(1 << CFQ_QHASH_SHIFT)
+#define list_entry_qhash(entry)	list_entry((entry), struct cfq_queue, cfq_hash)
+
+#define CFQ_MHASH_SHIFT		8
+#define CFQ_MHASH_BLOCK(sec)	((sec) >> 3)
+#define CFQ_MHASH_ENTRIES	(1 << CFQ_MHASH_SHIFT)
+#define CFQ_MHASH_FN(sec)	(hash_long(CFQ_MHASH_BLOCK((sec)),CFQ_MHASH_SHIFT))
+#define ON_MHASH(crq)		!list_empty(&(crq)->hash)
+#define rq_hash_key(rq)		((rq)->sector + (rq)->nr_sectors)
+#define list_entry_hash(ptr)	list_entry((ptr), struct cfq_rq, hash)
+
+#define list_entry_cfqq(ptr)	list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_prio(ptr)	list_entry((ptr), struct cfq_rq, prio_list)
+
+#define RQ_DATA(rq)		((struct cfq_rq *) (rq)->elevator_private)
+
+#define cfq_account_io(crq)	\
+	((crq)->ioprio != IOPRIO_IDLE && (crq)->ioprio != IOPRIO_RT)
+
+/*
+ * defines how we distribute bandwidth (can be tgid, uid, etc)
+ */
+#define cfq_hash_key(current)	((current)->tgid)
+
+/*
+ * move to io_context
+ */
+#define cfq_ioprio(current)	((current)->ioprio)
+
+#define CFQ_WAIT_RT	0
+#define CFQ_WAIT_NORM	1
+
+static kmem_cache_t *crq_pool;
+static kmem_cache_t *cfq_pool;
+static mempool_t *cfq_mpool;
+
+/*
+ * defines an io class
+ */
+struct class_io_data {
+	struct list_head rr_list;
+	int busy_queues;
+	int busy_rq;
+	struct list_head prio_list;
+	int last_queued;
+};
+
+/*
+ * per-request queue structure
+ */
+struct cfq_data {
+	struct list_head *dispatch;
+	struct list_head *cfq_hash;
+	struct list_head *crq_hash;
+	mempool_t *crq_pool;
+
+	struct class_io_data cid[IOPRIO_NR];
+
+	/*
+	 * total number of busy queues and requests
+	 */
+	int busy_rq;
+	int busy_queues;
+
+	unsigned long rq_starved_mask;
+
+	/*
+	 * grace period handling
+	 */
+	struct timer_list timer;
+	unsigned long wait_end;
+	unsigned long flags;
+	struct work_struct work;
+
+	/*
+	 * tunables
+	 */
+	unsigned int cfq_quantum;
+	unsigned int cfq_idle_quantum;
+	unsigned int cfq_queued;
+	unsigned int cfq_grace_rt;
+	unsigned int cfq_grace_idle;
+};
+
+/*
+ * per-class structure
+ */
+struct cfq_queue {
+	struct list_head cfq_list;
+	struct list_head cfq_hash;
+	int hash_key;
+	struct rb_root sort_list;
+	int queued[2];
+	int ioprio;
+};
+
+/*
+ * per-request structure
+ */
+struct cfq_rq {
+	struct cfq_queue *cfq_queue;
+	struct rb_node rb_node;
+	struct list_head hash;
+	sector_t rb_key;
+
+	struct request *request;
+
+	struct list_head prio_list;
+	int ioprio;
+};
+
+static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq);
+static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int pid);
+static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq);
+
+/*
+ * lots of deadline iosched dupes, can be abstracted later...
+ */
+static inline void __cfq_del_crq_hash(struct cfq_rq *crq)
+{
+	list_del_init(&crq->hash);
+}
+
+static inline void cfq_del_crq_hash(struct cfq_rq *crq)
+{
+	if (ON_MHASH(crq))
+		__cfq_del_crq_hash(crq);
+}
+
+static inline void
+cfq_remove_merge_hints(request_queue_t *q, struct cfq_rq *crq)
+{
+	cfq_del_crq_hash(crq);
+
+	if (q->last_merge == crq->request)
+		q->last_merge = NULL;
+}
+
+static inline void cfq_add_crq_hash(struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	struct request *rq = crq->request;
+
+	BUG_ON(ON_MHASH(crq));
+
+	list_add(&crq->hash, &cfqd->crq_hash[CFQ_MHASH_FN(rq_hash_key(rq))]);
+}
+
+static struct request *cfq_find_rq_hash(struct cfq_data *cfqd, sector_t offset)
+{
+	struct list_head *hash_list = &cfqd->crq_hash[CFQ_MHASH_FN(offset)];
+	struct list_head *entry, *next = hash_list->next;
+
+	while ((entry = next) != hash_list) {
+		struct cfq_rq *crq = list_entry_hash(entry);
+		struct request *__rq = crq->request;
+
+		next = entry->next;
+
+		BUG_ON(!ON_MHASH(crq));
+
+		if (!rq_mergeable(__rq)) {
+			__cfq_del_crq_hash(crq);
+			continue;
+		}
+
+		if (rq_hash_key(__rq) == offset)
+			return __rq;
+	}
+
+	return NULL;
+}
+
+/*
+ * rb tree support functions
+ */
+#define RB_NONE			(2)
+#define RB_EMPTY(node)		((node)->rb_node == NULL)
+#define RB_CLEAR(node)		((node)->rb_color = RB_NONE)
+#define RB_CLEAR_ROOT(root)	((root)->rb_node = NULL)
+#define ON_RB(node)		((node)->rb_color != RB_NONE)
+#define rb_entry_crq(node)	rb_entry((node), struct cfq_rq, rb_node)
+#define rq_rb_key(rq)		(rq)->sector
+
+static void
+cfq_del_crq_rb(struct cfq_data *cfqd, struct cfq_queue *cfqq,struct cfq_rq *crq)
+{
+	if (ON_RB(&crq->rb_node)) {
+		if (cfq_account_io(crq)) {
+			cfqd->busy_rq--;
+			cfqd->cid[crq->ioprio].busy_rq--;
+		}
+		cfqq->queued[rq_data_dir(crq->request)]--;
+		rb_erase(&crq->rb_node, &cfqq->sort_list);
+		RB_CLEAR(&crq->rb_node);
+		crq->cfq_queue = NULL;
+	}
+}
+
+static struct cfq_rq *
+__cfq_add_crq_rb(struct cfq_queue *cfqq, struct cfq_rq *crq)
+{
+	struct rb_node **p = &cfqq->sort_list.rb_node;
+	struct rb_node *parent = NULL;
+	struct cfq_rq *__crq;
+
+	while (*p) {
+		parent = *p;
+		__crq = rb_entry_crq(parent);
+
+		if (crq->rb_key < __crq->rb_key)
+			p = &(*p)->rb_left;
+		else if (crq->rb_key > __crq->rb_key)
+			p = &(*p)->rb_right;
+		else
+			return __crq;
+	}
+
+	rb_link_node(&crq->rb_node, parent, p);
+	return 0;
+}
+
+static void
+cfq_add_crq_rb(struct cfq_data *cfqd, struct cfq_queue *cfqq,struct cfq_rq *crq)
+{
+	struct request *rq = crq->request;
+	struct cfq_rq *__alias;
+
+	crq->rb_key = rq_rb_key(rq);
+	cfqq->queued[rq_data_dir(rq)]++;
+	if (cfq_account_io(crq)) {
+		cfqd->busy_rq++;
+		cfqd->cid[crq->ioprio].busy_rq++;
+	}
+retry:
+	__alias = __cfq_add_crq_rb(cfqq, crq);
+	if (!__alias) {
+		rb_insert_color(&crq->rb_node, &cfqq->sort_list);
+		crq->cfq_queue = cfqq;
+		return;
+	}
+
+	cfq_del_crq_rb(cfqd, cfqq, __alias);
+	cfq_dispatch_sort(cfqd->dispatch, __alias);
+	goto retry;
+}
+
+static struct request *
+cfq_find_rq_rb(struct cfq_data *cfqd, sector_t sector)
+{
+	struct cfq_queue *cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	struct rb_node *n;
+
+	if (!cfqq)
+		goto out;
+
+	n = cfqq->sort_list.rb_node;
+	while (n) {
+		struct cfq_rq *crq = rb_entry_crq(n);
+
+		if (sector < crq->rb_key)
+			n = n->rb_left;
+		else if (sector > crq->rb_key)
+			n = n->rb_right;
+		else
+			return crq->request;
+	}
+
+out:
+	return NULL;
+}
+
+static void cfq_remove_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	if (crq) {
+		cfq_remove_merge_hints(q, crq);
+		list_del_init(&crq->prio_list);
+		list_del_init(&rq->queuelist);
+
+		/*
+		 * set a grace period timer to allow realtime io to make real
+		 * progress, if we release an rt request. for normal request,
+		 * set timer so idle io doesn't interfere with other io
+		 */
+		if (crq->ioprio == IOPRIO_RT) {
+			set_bit(CFQ_WAIT_RT, &cfqd->flags);
+			cfqd->wait_end = jiffies + cfqd->cfq_grace_rt;
+		} else if (crq->ioprio != IOPRIO_IDLE) {
+			set_bit(CFQ_WAIT_NORM, &cfqd->flags);
+			cfqd->wait_end = jiffies + cfqd->cfq_grace_idle;
+		}
+
+		if (ON_RB(&crq->rb_node)) {
+			struct cfq_queue *cfqq = crq->cfq_queue;
+
+			cfq_del_crq_rb(cfqd, cfqq, crq);
+
+			if (RB_EMPTY(&cfqq->sort_list))
+				cfq_put_queue(cfqd, cfqq);
+		}
+	}
+}
+
+static int
+cfq_merge(request_queue_t *q, struct request **req, struct bio *bio)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct request *__rq;
+	int ret;
+
+	ret = elv_try_last_merge(q, bio);
+	if (ret != ELEVATOR_NO_MERGE) {
+		__rq = q->last_merge;
+		goto out_insert;
+	}
+
+	__rq = cfq_find_rq_hash(cfqd, bio->bi_sector);
+	if (__rq) {
+		BUG_ON(__rq->sector + __rq->nr_sectors != bio->bi_sector);
+
+		if (elv_rq_merge_ok(__rq, bio)) {
+			ret = ELEVATOR_BACK_MERGE;
+			goto out;
+		}
+	}
+
+	__rq = cfq_find_rq_rb(cfqd, bio->bi_sector + bio_sectors(bio));
+	if (__rq) {
+		if (elv_rq_merge_ok(__rq, bio)) {
+			ret = ELEVATOR_FRONT_MERGE;
+			goto out;
+		}
+	}
+
+	return ELEVATOR_NO_MERGE;
+out:
+	q->last_merge = __rq;
+out_insert:
+	*req = __rq;
+	return ret;
+}
+
+static void cfq_merged_request(request_queue_t *q, struct request *req)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(req);
+
+	cfq_del_crq_hash(crq);
+	cfq_add_crq_hash(cfqd, crq);
+
+	if (ON_RB(&crq->rb_node) && (rq_rb_key(req) != crq->rb_key)) {
+		struct cfq_queue *cfqq = crq->cfq_queue;
+
+		cfq_del_crq_rb(cfqd, cfqq, crq);
+		cfq_add_crq_rb(cfqd, cfqq, crq);
+	}
+
+	q->last_merge = req;
+}
+
+static void
+cfq_merged_requests(request_queue_t *q, struct request *req,
+		    struct request *next)
+{
+	cfq_merged_request(q, req);
+	cfq_remove_request(q, next);
+}
+
+/*
+ * sort into dispatch list, in optimal ascending order
+ */
+static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq)
+{
+	struct list_head *entry = head;
+	struct request *__rq;
+
+	if (!list_empty(head)) {
+		__rq = list_entry_rq(head->next);
+
+		if (crq->request->sector < __rq->sector) {
+			entry = head->prev;
+			goto link;
+		}
+	}
+
+	while ((entry = entry->prev) != head) {
+		__rq = list_entry_rq(entry);
+
+		if (crq->request->sector <= __rq->sector)
+			break;
+	}
+
+link:
+	list_add_tail(&crq->request->queuelist, entry);
+}
+
+/*
+ * remove from io scheduler core and put on dispatch list for service
+ */
+static inline void
+__cfq_dispatch_requests(request_queue_t *q, struct cfq_data *cfqd,
+			struct cfq_queue *cfqq)
+{
+	struct cfq_rq *crq;
+
+	crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+
+	cfq_del_crq_rb(cfqd, cfqq, crq);
+	cfq_remove_merge_hints(q, crq);
+	cfq_dispatch_sort(cfqd->dispatch, crq);
+
+	/*
+	 * technically, for IOPRIO_RT we don't need to add it to the list.
+	 */
+	list_add_tail(&crq->prio_list, &cfqd->cid[cfqq->ioprio].prio_list);
+}
+
+static int
+cfq_dispatch_requests(request_queue_t *q, int prio, int may_queue)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct list_head *plist = &cfqd->cid[prio].rr_list;
+	struct list_head *entry, *nxt;
+	int queued = 0;
+
+	/*
+	 * for each queue at this prio level, dispatch a request
+	 */
+	list_for_each_safe(entry, nxt, plist) {
+		struct cfq_queue *cfqq = list_entry_cfqq(entry);
+
+		BUG_ON(RB_EMPTY(&cfqq->sort_list));
+
+		__cfq_dispatch_requests(q, cfqd, cfqq);
+		queued++;
+
+		if (RB_EMPTY(&cfqq->sort_list))
+			cfq_put_queue(cfqd, cfqq);
+
+		/*
+		 * if we hit the queue limit, put the string of serviced
+		 * queues at the back of the pending list
+		 */
+		if (queued == may_queue) {
+			struct list_head *prv = nxt->prev;
+
+			if (prv != plist) {
+				list_del(plist);
+				list_add(plist, prv);
+			}
+			break;
+		}
+	}
+
+	return queued;
+}
+
+/*
+ * try to move some requests to the dispatch list. return 0 on success
+ */
+static int cfq_select_requests(request_queue_t *q, struct cfq_data *cfqd)
+{
+	int queued, ios, i;
+
+	/*
+	 * if there's any realtime io, only schedule that
+	 */
+	if (cfq_dispatch_requests(q, IOPRIO_RT, cfqd->cfq_quantum))
+		return 1;
+
+	/*
+	 * if RT io was last serviced and grace time hasn't expired,
+	 * arm the timer to restart queueing if no other RT io has been
+	 * submitted in the mean time
+	 */
+	if (test_bit(CFQ_WAIT_RT, &cfqd->flags)) {
+		if (time_before(jiffies, cfqd->wait_end)) {
+			mod_timer(&cfqd->timer, cfqd->wait_end);
+			return 0;
+		}
+		clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+	}
+
+	/*
+	 * for each priority class, calculate number of requests we
+	 * are allowed to put into service.
+	 */
+	queued = 0;
+	ios = cfqd->busy_rq;
+	for (i = IOPRIO_RT - 1; i > IOPRIO_IDLE; i--) {
+		const int other_io = ios - cfqd->cid[i].busy_rq;
+		int may_queue = cfqd->cfq_quantum;
+		int dispatched;
+
+		/*
+		 * no need to keep iterating the list, if there are no
+		 * requests pending anymore
+		 */
+		if (!cfqd->busy_rq)
+			break;
+
+		if (other_io)
+			may_queue = other_io * (i + 1) / IOPRIO_NR;
+		if (may_queue > cfqd->cfq_quantum)
+			may_queue = cfqd->cfq_quantum;
+		/*
+		 * average with last dispatched for fairness
+		 */
+		if (cfqd->cid[i].last_queued != -1)
+			may_queue = (cfqd->cid[i].last_queued + may_queue) / 2;
+
+		dispatched = cfq_dispatch_requests(q, i, may_queue);
+
+		cfqd->cid[i].last_queued = dispatched;
+		queued += dispatched;
+	}
+
+	if (queued)
+		return 1;
+
+	/*
+	 * only allow dispatch of idle io, if the queue has been idle from
+	 * servicing RT or normal io for the grace period
+	 */
+	if (test_bit(CFQ_WAIT_NORM, &cfqd->flags)) {
+		if (time_before(jiffies, cfqd->wait_end)) {
+			mod_timer(&cfqd->timer, cfqd->wait_end);
+			return 0;
+		}
+		clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+	}
+
+	/*
+	 * if we found nothing to do, allow idle io to be serviced
+	 */
+	if (cfq_dispatch_requests(q, IOPRIO_IDLE, cfqd->cfq_idle_quantum))
+		return 1;
+
+	return 0;
+}
+
+static struct request *cfq_next_request(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct request *rq;
+
+	if (!list_empty(cfqd->dispatch)) {
+		struct cfq_rq *crq;
+dispatch:
+		/*
+		 * end grace period, we are servicing a request
+		 */
+		del_timer(&cfqd->timer);
+		clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+		clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+
+		BUG_ON(list_empty(cfqd->dispatch));
+		rq = list_entry_rq(cfqd->dispatch->next);
+
+		BUG_ON(q->last_merge == rq);
+		crq = RQ_DATA(rq);
+		if (crq) {
+			BUG_ON(ON_MHASH(crq));
+			list_del_init(&crq->prio_list);
+		}
+
+		return rq;
+	}
+
+	/*
+	 * we moved requests to dispatch list, go back end serve one
+	 */
+	if (cfq_select_requests(q, cfqd))
+		goto dispatch;
+
+	return NULL;
+}
+
+static inline struct cfq_queue *
+__cfq_find_cfq_hash(struct cfq_data *cfqd, int hashkey, const int hashval)
+{
+	struct list_head *hash_list = &cfqd->cfq_hash[hashval];
+	struct list_head *entry;
+
+	list_for_each(entry, hash_list) {
+		struct cfq_queue *__cfqq = list_entry_qhash(entry);
+
+		if (__cfqq->hash_key == hashkey)
+			return __cfqq;
+	}
+
+	return NULL;
+}
+
+static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int hashkey)
+{
+	const int hashval = hash_long(hashkey, CFQ_QHASH_SHIFT);
+
+	return __cfq_find_cfq_hash(cfqd, hashkey, hashval);
+}
+
+static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	cfqd->busy_queues--;
+	WARN_ON(cfqd->busy_queues < 0);
+
+	cfqd->cid[cfqq->ioprio].busy_queues--;
+	WARN_ON(cfqd->cid[cfqq->ioprio].busy_queues < 0);
+
+	list_del(&cfqq->cfq_list);
+	list_del(&cfqq->cfq_hash);
+	mempool_free(cfqq, cfq_mpool);
+}
+
+static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, int hashkey)
+{
+	const int hashval = hash_long(hashkey, CFQ_QHASH_SHIFT);
+	struct cfq_queue *cfqq = __cfq_find_cfq_hash(cfqd, hashkey, hashval);
+
+	if (!cfqq) {
+		cfqq = mempool_alloc(cfq_mpool, GFP_NOIO);
+
+		INIT_LIST_HEAD(&cfqq->cfq_hash);
+		INIT_LIST_HEAD(&cfqq->cfq_list);
+		RB_CLEAR_ROOT(&cfqq->sort_list);
+
+		cfqq->hash_key = cfq_hash_key(current);
+		cfqq->queued[0] = cfqq->queued[1] = 0;
+		cfqq->ioprio = cfq_ioprio(current);
+		list_add(&cfqq->cfq_hash, &cfqd->cfq_hash[hashval]);
+	}
+
+	return cfqq;
+}
+
+static void
+__cfq_enqueue(request_queue_t *q, struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	const int prio = crq->ioprio;
+	struct cfq_queue *cfqq;
+
+	cfqq = cfq_get_queue(cfqd, cfq_hash_key(current));
+
+	/*
+	 * not too good...
+	 */
+	if (prio > cfqq->ioprio) {
+		if (!list_empty(&cfqq->cfq_list)) {
+			cfqd->cid[cfqq->ioprio].busy_queues--;
+			WARN_ON(cfqd->cid[cfqq->ioprio].busy_queues < 0);
+			cfqd->cid[prio].busy_queues++;
+			list_move_tail(&cfqq->cfq_list,&cfqd->cid[prio].rr_list);
+		}
+		cfqq->ioprio = prio;
+	}
+
+	cfq_add_crq_rb(cfqd, cfqq, crq);
+
+	if (list_empty(&cfqq->cfq_list)) {
+		list_add_tail(&cfqq->cfq_list, &cfqd->cid[prio].rr_list);
+		cfqd->cid[prio].busy_queues++;
+		cfqd->busy_queues++;
+	}
+
+	if (rq_mergeable(crq->request)) {
+		cfq_add_crq_hash(cfqd, crq);
+
+		if (!q->last_merge)
+			q->last_merge = crq->request;
+	}
+
+}
+
+static void cfq_reenqueue(request_queue_t *q, struct cfq_data *cfqd, int prio)
+{
+	struct list_head *prio_list = &cfqd->cid[prio].prio_list;
+	struct list_head *entry, *tmp;
+
+	list_for_each_safe(entry, tmp, prio_list) {
+		struct cfq_rq *crq = list_entry_prio(entry);
+
+		list_del_init(entry);
+		list_del_init(&crq->request->queuelist);
+		__cfq_enqueue(q, cfqd, crq);
+	}
+}
+
+static void
+cfq_enqueue(request_queue_t *q, struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	const int prio = cfq_ioprio(current);
+
+	crq->ioprio = prio;
+	__cfq_enqueue(q, cfqd, crq);
+
+	if (prio == IOPRIO_RT) {
+		int i;
+
+		/*
+		 * realtime io gets priority, move all other io back
+		 */
+		for (i = IOPRIO_IDLE; i < IOPRIO_RT; i++)
+			cfq_reenqueue(q, cfqd, i);
+	} else if (prio != IOPRIO_IDLE) {
+		/*
+		 * check if we need to move idle io back into queue
+		 */
+		cfq_reenqueue(q, cfqd, IOPRIO_IDLE);
+	}
+}
+
+static void
+cfq_insert_request(request_queue_t *q, struct request *rq, int where)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	switch (where) {
+		case ELEVATOR_INSERT_BACK:
+#if 0
+			while (cfq_dispatch_requests(q, cfqd))
+				;
+#endif
+			list_add_tail(&rq->queuelist, cfqd->dispatch);
+			break;
+		case ELEVATOR_INSERT_FRONT:
+			list_add(&rq->queuelist, cfqd->dispatch);
+			break;
+		case ELEVATOR_INSERT_SORT:
+			BUG_ON(!blk_fs_request(rq));
+			cfq_enqueue(q, cfqd, crq);
+			break;
+		default:
+			printk("%s: bad insert point %d\n", __FUNCTION__,where);
+			return;
+	}
+}
+
+static int cfq_queue_empty(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+
+	if (list_empty(cfqd->dispatch) && !cfqd->busy_queues)
+		return 1;
+
+	return 0;
+}
+
+static struct request *
+cfq_former_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_rq *crq = RQ_DATA(rq);
+	struct rb_node *rbprev = rb_prev(&crq->rb_node);
+
+	if (rbprev)
+		return rb_entry_crq(rbprev)->request;
+
+	return NULL;
+}
+
+static struct request *
+cfq_latter_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_rq *crq = RQ_DATA(rq);
+	struct rb_node *rbnext = rb_next(&crq->rb_node);
+
+	if (rbnext)
+		return rb_entry_crq(rbnext)->request;
+
+	return NULL;
+}
+
+static void cfq_queue_congested(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+
+	set_bit(cfq_ioprio(current), &cfqd->rq_starved_mask);
+}
+
+static int cfq_may_queue(request_queue_t *q, int rw)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_queue *cfqq;
+	const int prio = cfq_ioprio(current);
+	int limit, ret = 1;
+
+	if (!cfqd->busy_queues)
+		goto out;
+
+	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	if (!cfqq)
+		goto out;
+
+	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	if (!cfqq)
+		goto out;
+
+	/*
+	 * if higher or equal prio io is sleeping waiting for a request, don't
+	 * allow this one to allocate one. as long as ll_rw_blk does fifo
+	 * waitqueue wakeups this should work...
+	 */
+	if (cfqd->rq_starved_mask & ~((1 << prio) - 1))
+		goto out;
+
+	if (cfqq->queued[rw] < cfqd->cfq_queued || !cfqd->cid[prio].busy_queues)
+		goto out;
+
+	limit = q->nr_requests * (prio + 1) / IOPRIO_NR;
+	limit /= cfqd->cid[prio].busy_queues;
+	if (cfqq->queued[rw] > limit)
+		ret = 0;
+
+out:
+	return ret;
+}
+
+static void cfq_put_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	if (crq) {
+		BUG_ON(q->last_merge == rq);
+		BUG_ON(ON_MHASH(crq));
+
+		mempool_free(crq, cfqd->crq_pool);
+		rq->elevator_private = NULL;
+	}
+}
+
+static int cfq_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = mempool_alloc(cfqd->crq_pool, gfp_mask);
+
+	if (crq) {
+		/*
+		 * process now has one request
+		 */
+		clear_bit(cfq_ioprio(current), &cfqd->rq_starved_mask);
+
+		memset(crq, 0, sizeof(*crq));
+		RB_CLEAR(&crq->rb_node);
+		crq->request = rq;
+		INIT_LIST_HEAD(&crq->hash);
+		INIT_LIST_HEAD(&crq->prio_list);
+		rq->elevator_private = crq;
+		return 0;
+	}
+
+	return 1;
+}
+
+static void cfq_exit(request_queue_t *q, elevator_t *e)
+{
+	struct cfq_data *cfqd = e->elevator_data;
+
+	e->elevator_data = NULL;
+	mempool_destroy(cfqd->crq_pool);
+	kfree(cfqd->crq_hash);
+	kfree(cfqd->cfq_hash);
+	kfree(cfqd);
+}
+
+static void cfq_timer(unsigned long data)
+{
+	struct cfq_data *cfqd = (struct cfq_data *) data;
+
+	clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+	clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+	kblockd_schedule_work(&cfqd->work);
+}
+
+static void cfq_work(void *data)
+{
+	request_queue_t *q = data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (cfq_next_request(q))
+		q->request_fn(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static int cfq_init(request_queue_t *q, elevator_t *e)
+{
+	struct cfq_data *cfqd;
+	int i;
+
+	cfqd = kmalloc(sizeof(*cfqd), GFP_KERNEL);
+	if (!cfqd)
+		return -ENOMEM;
+
+	memset(cfqd, 0, sizeof(*cfqd));
+
+	init_timer(&cfqd->timer);
+	cfqd->timer.function = cfq_timer;
+	cfqd->timer.data = (unsigned long) cfqd;
+
+	INIT_WORK(&cfqd->work, cfq_work, q);
+
+	for (i = 0; i < IOPRIO_NR; i++) {
+		struct class_io_data *cid = &cfqd->cid[i];
+
+		INIT_LIST_HEAD(&cid->rr_list);
+		INIT_LIST_HEAD(&cid->prio_list);
+		cid->last_queued = -1;
+	}
+
+	cfqd->crq_hash = kmalloc(sizeof(struct list_head) * CFQ_MHASH_ENTRIES, GFP_KERNEL);
+	if (!cfqd->crq_hash)
+		goto out_crqhash;
+
+	cfqd->cfq_hash = kmalloc(sizeof(struct list_head) * CFQ_QHASH_ENTRIES, GFP_KERNEL);
+	if (!cfqd->cfq_hash)
+		goto out_cfqhash;
+
+	cfqd->crq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, crq_pool);
+	if (!cfqd->crq_pool)
+		goto out_crqpool;
+
+	for (i = 0; i < CFQ_MHASH_ENTRIES; i++)
+		INIT_LIST_HEAD(&cfqd->crq_hash[i]);
+	for (i = 0; i < CFQ_QHASH_ENTRIES; i++)
+		INIT_LIST_HEAD(&cfqd->cfq_hash[i]);
+
+	cfqd->cfq_queued = cfq_queued;
+	cfqd->cfq_quantum = cfq_quantum;
+	cfqd->cfq_idle_quantum = cfq_idle_quantum;
+	cfqd->cfq_grace_rt = cfq_grace_rt;
+	cfqd->cfq_grace_idle = cfq_grace_idle;
+
+	q->nr_requests <<= 2;
+
+	cfqd->dispatch = &q->queue_head;
+	e->elevator_data = cfqd;
+
+	return 0;
+out_crqpool:
+	kfree(cfqd->cfq_hash);
+out_cfqhash:
+	kfree(cfqd->crq_hash);
+out_crqhash:
+	kfree(cfqd);
+	return -ENOMEM;
+}
+
+static int __init cfq_slab_setup(void)
+{
+	crq_pool = kmem_cache_create("crq_pool", sizeof(struct cfq_rq), 0, 0,
+					NULL, NULL);
+
+	if (!crq_pool)
+		panic("cfq_iosched: can't init crq pool\n");
+
+	cfq_pool = kmem_cache_create("cfq_pool", sizeof(struct cfq_queue), 0, 0,
+					NULL, NULL);
+
+	if (!cfq_pool)
+		panic("cfq_iosched: can't init cfq pool\n");
+
+	cfq_mpool = mempool_create(64, mempool_alloc_slab, mempool_free_slab, cfq_pool);
+
+	if (!cfq_mpool)
+		panic("cfq_iosched: can't init cfq mpool\n");
+
+	return 0;
+}
+
+subsys_initcall(cfq_slab_setup);
+
+/*
+ * sysfs parts below -->
+ */
+struct cfq_fs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct cfq_data *, char *);
+	ssize_t (*store)(struct cfq_data *, const char *, size_t);
+};
+
+static ssize_t
+cfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+cfq_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR)					\
+static ssize_t __FUNC(struct cfq_data *cfqd, char *page)		\
+{									\
+	return cfq_var_show(__VAR, (page));				\
+}
+SHOW_FUNCTION(cfq_quantum_show, cfqd->cfq_quantum);
+SHOW_FUNCTION(cfq_idle_quantum_show, cfqd->cfq_idle_quantum);
+SHOW_FUNCTION(cfq_queued_show, cfqd->cfq_queued);
+SHOW_FUNCTION(cfq_grace_rt_show, cfqd->cfq_grace_rt);
+SHOW_FUNCTION(cfq_grace_idle_show, cfqd->cfq_grace_idle);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)				\
+static ssize_t __FUNC(struct cfq_data *cfqd, const char *page, size_t count)	\
+{									\
+	int ret = cfq_var_store(__PTR, (page), count);			\
+	if (*(__PTR) < (MIN))						\
+		*(__PTR) = (MIN);					\
+	else if (*(__PTR) > (MAX))					\
+		*(__PTR) = (MAX);					\
+	return ret;							\
+}
+STORE_FUNCTION(cfq_quantum_store, &cfqd->cfq_quantum, 0, INT_MAX);
+STORE_FUNCTION(cfq_idle_quantum_store, &cfqd->cfq_idle_quantum, 0, INT_MAX);
+STORE_FUNCTION(cfq_queued_store, &cfqd->cfq_queued, 0, INT_MAX);
+STORE_FUNCTION(cfq_grace_rt_store, &cfqd->cfq_grace_rt, 0, INT_MAX);
+STORE_FUNCTION(cfq_grace_idle_store, &cfqd->cfq_grace_idle, 0, INT_MAX);
+#undef STORE_FUNCTION
+
+static struct cfq_fs_entry cfq_quantum_entry = {
+	.attr = {.name = "quantum", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_quantum_show,
+	.store = cfq_quantum_store,
+};
+static struct cfq_fs_entry cfq_idle_quantum_entry = {
+	.attr = {.name = "idle_quantum", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_idle_quantum_show,
+	.store = cfq_idle_quantum_store,
+};
+static struct cfq_fs_entry cfq_queued_entry = {
+	.attr = {.name = "queued", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_queued_show,
+	.store = cfq_queued_store,
+};
+static struct cfq_fs_entry cfq_grace_rt_entry = {
+	.attr = {.name = "grace_rt", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_grace_rt_show,
+	.store = cfq_grace_rt_store,
+};
+static struct cfq_fs_entry cfq_grace_idle_entry = {
+	.attr = {.name = "grace_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_grace_idle_show,
+	.store = cfq_grace_idle_store,
+};
+
+static struct attribute *default_attrs[] = {
+	&cfq_quantum_entry.attr,
+	&cfq_idle_quantum_entry.attr,
+	&cfq_queued_entry.attr,
+	&cfq_grace_rt_entry.attr,
+	&cfq_grace_idle_entry.attr,
+	NULL,
+};
+
+#define to_cfq(atr) container_of((atr), struct cfq_fs_entry, attr)
+
+static ssize_t
+cfq_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
+{
+	elevator_t *e = container_of(kobj, elevator_t, kobj);
+	struct cfq_fs_entry *entry = to_cfq(attr);
+
+	if (!entry->show)
+		return 0;
+
+	return entry->show(e->elevator_data, page);
+}
+
+static ssize_t
+cfq_attr_store(struct kobject *kobj, struct attribute *attr,
+	       const char *page, size_t length)
+{
+	elevator_t *e = container_of(kobj, elevator_t, kobj);
+	struct cfq_fs_entry *entry = to_cfq(attr);
+
+	if (!entry->store)
+		return -EINVAL;
+
+	return entry->store(e->elevator_data, page, length);
+}
+
+static struct sysfs_ops cfq_sysfs_ops = {
+	.show	= cfq_attr_show,
+	.store	= cfq_attr_store,
+};
+
+struct kobj_type cfq_ktype = {
+	.sysfs_ops	= &cfq_sysfs_ops,
+	.default_attrs	= default_attrs,
+};
+
+elevator_t iosched_cfq = {
+	.elevator_name =		"cfq",
+	.elevator_ktype =		&cfq_ktype,
+	.elevator_merge_fn = 		cfq_merge,
+	.elevator_merged_fn =		cfq_merged_request,
+	.elevator_merge_req_fn =	cfq_merged_requests,
+	.elevator_next_req_fn =		cfq_next_request,
+	.elevator_add_req_fn =		cfq_insert_request,
+	.elevator_remove_req_fn =	cfq_remove_request,
+	.elevator_queue_empty_fn =	cfq_queue_empty,
+	.elevator_former_req_fn =	cfq_former_request,
+	.elevator_latter_req_fn =	cfq_latter_request,
+	.elevator_set_req_fn =		cfq_set_request,
+	.elevator_put_req_fn =		cfq_put_request,
+	.elevator_may_queue_fn =	cfq_may_queue,
+	.elevator_set_congested_fn =	cfq_queue_congested,
+	.elevator_init_fn =		cfq_init,
+	.elevator_exit_fn =		cfq_exit,
+};
+
+EXPORT_SYMBOL(iosched_cfq);

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cfq + io priorities
  2003-11-08 12:47 Jens Axboe
@ 2003-11-08 13:25 ` Jens Axboe
  2003-11-08 14:06   ` Jens Axboe
  2003-11-09 21:30 ` Shailabh Nagar
  1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2003-11-08 13:25 UTC (permalink / raw)
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 248 bytes --]

On Sat, Nov 08 2003, Jens Axboe wrote:
> I'm attaching the simple ionice tool. It's used as follows:

Here's one that works, sorry about that. To compile:

# gcc -Wall -D__X86 -o ionice ionice.c

or other define for PPC or X86_64.

-- 
Jens Axboe


[-- Attachment #2: ionice.c --]
[-- Type: text/plain, Size: 987 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <getopt.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <asm/unistd.h>

extern int sys_ioprio_set(int);
extern int sys_ioprio_get(void);

#ifdef __X86
#define __NR_ioprio_set		274
#define __NR_ioprio_get		275
#endif

#ifdef __X86_64
#define __NR_ioprio_set		237
#define __NR_ioprio_get		238
#endif

#ifdef __PPC
#define __NR_ioprio_set		255
#define __NR_ioprio_get		256
#endif

#ifndef __NR_ioprio_set
#error set arch
#endif

_syscall1(int, ioprio_set, int, ioprio);
_syscall0(int, ioprio_get);

int main(int argc, char *argv[])
{
	int ioprio = 2, set = 0;
	int c;

	while ((c = getopt(argc, argv, "+n:")) != EOF) {
		switch (c) {
		case 'n':
			ioprio = strtol(optarg, NULL, 10);
			set = 1;
			break;
		}
	}

	if (!set)
		printf("%d\n", ioprio_get());
	else if (argv[optind]) {
		if (ioprio_set(ioprio) == -1) {
			perror("ioprio_set");
			return 1;
		}

		execvp(argv[optind], &argv[optind]);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] cfq + io priorities
@ 2003-11-08 12:47 Jens Axboe
  2003-11-08 13:25 ` Jens Axboe
  2003-11-09 21:30 ` Shailabh Nagar
  0 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2003-11-08 12:47 UTC (permalink / raw)
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 40102 bytes --]

Hi,

I've implemented IO nice levels in the CFQ io scheduler. It works as
follows.

A process has an assigned io nice level, anywhere from 0 to 20. Both of
these end values are "special" - 0 means the process is only allowed to
do io if the disk is idle, and 20 means the process io is considered
realtime. Realtime IO always gets first access to the disk. Values from
1 to 19 assign 5-95% of disk bandwidth to that process. Any io class is
allowed to use all of disk bandwidth in absence of higher priority io.

Idle and realtime IO settings work as expected, but not much tuning has
gone into making sure that the individual levels in-between work 100% as
expected. It should be good enough for some testing at least, even if it
has some holes.

About the patch: stuff like this really needs some resource management
abstraction like CKRM. Right now we just look at the tgid of the
process. I've added two syscalls for setting and getting io priority.
Don't consider this final or anything, it's just easy for testing. Patch
has been tested on x86 and ppc, syscalls are also added for x86_64.

I'm attaching the simple ionice tool. It's used as follows:

# ionice -n20 bash

starts a bash shell with realtime io. Beware that io level is inherited
on fork, so any program you start from this shell will also run with
realtime io.

# ionice -n0 dbench 32

run some dbench thrasher, but only when disk is idle.

Pretty straight forward :-)

For really good results, you probably also want to set cpu nice level.
Needless to say, a realtime io process can only submit io when it gets
scheduled.

Default IO priority for a new process is 10.

Patch is against bk-current.

===== arch/i386/kernel/entry.S 1.69 vs edited =====
--- 1.69/arch/i386/kernel/entry.S	Wed Oct  1 15:53:17 2003
+++ edited/arch/i386/kernel/entry.S	Sat Nov  8 11:53:19 2003
@@ -880,5 +880,7 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
+	.long sys_ioprio_set
+	.long sys_ioprio_get
 
 nr_syscalls=(.-sys_call_table)/4
===== arch/ppc/kernel/misc.S 1.49 vs edited =====
--- 1.49/arch/ppc/kernel/misc.S	Fri Sep 12 18:26:52 2003
+++ edited/arch/ppc/kernel/misc.S	Sat Nov  8 11:53:19 2003
@@ -1385,3 +1385,5 @@
 	.long sys_statfs64
 	.long sys_fstatfs64
 	.long ppc_fadvise64_64
+	.long sys_ioprio_set
+	.long sys_ioprio_get
===== drivers/block/Kconfig.iosched 1.2 vs edited =====
--- 1.2/drivers/block/Kconfig.iosched	Fri Aug 15 03:16:57 2003
+++ edited/drivers/block/Kconfig.iosched	Sat Nov  8 11:55:04 2003
@@ -27,3 +27,11 @@
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_CFQ
+	bool "CFQ I/O scheduler" if EMBEDDED
+	default y
+	---help---
+	  The CFQ I/O scheduler tries to distribute bandwidth equally
+	  among all processes in the system, with the option of assigning
+	  io priorities a process or process group. It should provide a fair
+	  working environment, suitable for desktop systems.
===== drivers/block/Makefile 1.22 vs edited =====
--- 1.22/drivers/block/Makefile	Thu Oct 16 06:38:46 2003
+++ edited/drivers/block/Makefile	Sat Nov  8 11:53:19 2003
@@ -18,6 +18,7 @@
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
+obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_MAC_FLOPPY)	+= swim3.o
 obj-$(CONFIG_BLK_DEV_FD)	+= floppy.o
 obj-$(CONFIG_BLK_DEV_FD98)	+= floppy98.o
===== drivers/block/elevator.c 1.52 vs edited =====
--- 1.52/drivers/block/elevator.c	Fri Sep  5 12:13:05 2003
+++ edited/drivers/block/elevator.c	Sat Nov  8 11:53:19 2003
@@ -302,6 +302,14 @@
 		e->elevator_put_req_fn(q, rq);
 }
 
+void elv_set_congested(request_queue_t *q)
+{
+	elevator_t *e = &q->elevator;
+
+	if (e->elevator_set_congested_fn)
+		e->elevator_set_congested_fn(q);
+}
+
 int elv_may_queue(request_queue_t *q, int rw)
 {
 	elevator_t *e = &q->elevator;
@@ -309,7 +317,7 @@
 	if (e->elevator_may_queue_fn)
 		return e->elevator_may_queue_fn(q, rw);
 
-	return 0;
+	return 1;
 }
 
 void elv_completed_request(request_queue_t *q, struct request *rq)
===== drivers/block/ll_rw_blk.c 1.221 vs edited =====
--- 1.221/drivers/block/ll_rw_blk.c	Thu Nov  6 00:11:47 2003
+++ edited/drivers/block/ll_rw_blk.c	Sat Nov  8 11:53:19 2003
@@ -1331,7 +1331,9 @@
 static int __make_request(request_queue_t *, struct bio *);
 
 static elevator_t *chosen_elevator =
-#if defined(CONFIG_IOSCHED_AS)
+#if defined(CONFIG_IOSCHED_CFQ)
+	&iosched_cfq;
+#elif defined(CONFIG_IOSCHED_AS)
 	&iosched_as;
 #elif defined(CONFIG_IOSCHED_DEADLINE)
 	&iosched_deadline;
@@ -1353,6 +1355,10 @@
 	if (!strcmp(str, "as"))
 		chosen_elevator = &iosched_as;
 #endif
+#ifdef CONFIG_IOSCHED_CFQ
+	if (!strcmp(str, "cfq"))
+		chosen_elevator = &iosched_cfq;
+#endif
 #ifdef CONFIG_IOSCHED_NOOP
 	if (!strcmp(str, "noop"))
 		chosen_elevator = &elevator_noop;
@@ -1553,6 +1559,10 @@
 	struct io_context *ioc = get_io_context(gfp_mask);
 
 	spin_lock_irq(q->queue_lock);
+
+	if (!elv_may_queue(q, rw))
+		goto out_lock;
+
 	if (rl->count[rw]+1 >= q->nr_requests) {
 		/*
 		 * The queue will fill after this allocation, so set it as
@@ -1566,15 +1576,12 @@
 		}
 	}
 
-	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
-		/*
-		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
-		 */
-		spin_unlock_irq(q->queue_lock);
-		goto out;
-	}
+	/*
+	 * The queue is full and the allocating process is not a
+	 * "batcher", and not exempted by the IO scheduler
+	 */
+	if (blk_queue_full(q, rw) && !ioc_batching(ioc))
+		goto out_lock;
 
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
@@ -1592,8 +1599,7 @@
 		 */
 		spin_lock_irq(q->queue_lock);
 		freed_request(q, rw);
-		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_lock;
 	}
 
 	if (ioc_batching(ioc))
@@ -1621,7 +1627,12 @@
 
 out:
 	put_io_context(ioc);
+	if (!rq)
+		elv_set_congested(q);
 	return rq;
+out_lock:
+	spin_unlock_irq(q->queue_lock);
+	goto out;
 }
 
 /*
@@ -2936,3 +2947,21 @@
 		kobject_put(&disk->kobj);
 	}
 }
+
+asmlinkage int sys_ioprio_set(int ioprio)
+{
+	if (ioprio < IOPRIO_IDLE || ioprio > IOPRIO_RT)
+		return -EINVAL;
+	if (ioprio == IOPRIO_RT && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	printk("%s: set ioprio %d\n", current->comm, ioprio);
+	current->ioprio = ioprio;
+	return 0;
+}
+
+asmlinkage int sys_ioprio_get(void)
+{
+	return current->ioprio;
+}
+
===== include/asm-i386/unistd.h 1.30 vs edited =====
--- 1.30/include/asm-i386/unistd.h	Thu Oct  2 09:12:21 2003
+++ edited/include/asm-i386/unistd.h	Sat Nov  8 11:53:20 2003
@@ -279,8 +279,10 @@
 #define __NR_utimes		271
 #define __NR_fadvise64_64	272
 #define __NR_vserver		273
+#define __NR_ioprio_set		274
+#define __NR_ioprio_get		275
 
-#define NR_syscalls 274
+#define NR_syscalls 276
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
===== include/asm-ppc/unistd.h 1.26 vs edited =====
--- 1.26/include/asm-ppc/unistd.h	Sat Aug 23 04:15:18 2003
+++ edited/include/asm-ppc/unistd.h	Sat Nov  8 11:53:20 2003
@@ -259,8 +259,10 @@
 #define __NR_statfs64		252
 #define __NR_fstatfs64		253
 #define __NR_fadvise64_64	254
+#define __NR_ioprio_set		255
+#define __NR_ioprio_get		256
 
-#define __NR_syscalls		255
+#define __NR_syscalls		257
 
 #define __NR(n)	#n
 
===== include/asm-x86_64/unistd.h 1.18 vs edited =====
--- 1.18/include/asm-x86_64/unistd.h	Sun Oct  5 18:35:37 2003
+++ edited/include/asm-x86_64/unistd.h	Sat Nov  8 11:53:20 2003
@@ -532,8 +532,12 @@
 __SYSCALL(__NR_utimes, sys_utimes)
 #define __NR_vserver		236
 __SYSCALL(__NR_vserver, sys_ni_syscall)
+#define __NR_ioprio_set		237
+__SYSCALL(__NR_ioprio_set, sys_ioprio_set);
+#define __NR_ioprio_get		238
+__SYSCALL(__NR_ioprio_get, sys_ioprio_get);
 
-#define __NR_syscall_max __NR_vserver
+#define __NR_syscall_max __ioprio_get
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */
===== include/linux/elevator.h 1.29 vs edited =====
--- 1.29/include/linux/elevator.h	Sun Sep 21 23:50:12 2003
+++ edited/include/linux/elevator.h	Sat Nov  8 11:53:20 2003
@@ -17,6 +17,7 @@
 typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *);
 typedef void (elevator_completed_req_fn) (request_queue_t *, struct request *);
 typedef int (elevator_may_queue_fn) (request_queue_t *, int);
+typedef void (elevator_set_congested_fn) (request_queue_t *);
 
 typedef int (elevator_set_req_fn) (request_queue_t *, struct request *, int);
 typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
@@ -45,6 +46,7 @@
 	elevator_put_req_fn *elevator_put_req_fn;
 
 	elevator_may_queue_fn *elevator_may_queue_fn;
+	elevator_set_congested_fn *elevator_set_congested_fn;
 
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
@@ -74,6 +76,7 @@
 extern int elv_register_queue(request_queue_t *q);
 extern void elv_unregister_queue(request_queue_t *q);
 extern int elv_may_queue(request_queue_t *, int);
+extern void elv_set_congested(request_queue_t *);
 extern void elv_completed_request(request_queue_t *, struct request *);
 extern int elv_set_request(request_queue_t *, struct request *, int);
 extern void elv_put_request(request_queue_t *, struct request *);
@@ -93,6 +96,11 @@
  * anticipatory I/O scheduler
  */
 extern elevator_t iosched_as;
+
+/*
+ * completely fair queueing I/O scheduler
+ */
+extern elevator_t iosched_cfq;
 
 extern int elevator_init(request_queue_t *, elevator_t *);
 extern void elevator_exit(request_queue_t *);
===== include/linux/fs.h 1.274 vs edited =====
--- 1.274/include/linux/fs.h	Tue Sep 23 06:16:30 2003
+++ edited/include/linux/fs.h	Sat Nov  8 11:53:20 2003
@@ -1408,5 +1408,16 @@
 	return res;
 }
 
+/* io priorities */
+
+#define IOPRIO_NR      21
+
+#define IOPRIO_IDLE	0
+#define IOPRIO_NORM	10
+#define IOPRIO_RT	20
+
+asmlinkage int sys_ioprio_set(int ioprio);
+asmlinkage int sys_ioprio_get(void);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_FS_H */
===== include/linux/init_task.h 1.27 vs edited =====
--- 1.27/include/linux/init_task.h	Tue Aug 19 04:46:23 2003
+++ edited/include/linux/init_task.h	Sat Nov  8 11:53:20 2003
@@ -108,6 +108,7 @@
 	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.ioprio		= IOPRIO_NORM,					\
 }
 
 
===== include/linux/sched.h 1.174 vs edited =====
--- 1.174/include/linux/sched.h	Mon Oct 27 04:16:19 2003
+++ edited/include/linux/sched.h	Sat Nov  8 11:53:20 2003
@@ -462,6 +462,8 @@
 
 	struct io_context *io_context;
 
+	int ioprio;
+
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 };
===== kernel/fork.c 1.145 vs edited =====
--- 1.145/kernel/fork.c	Fri Oct 10 00:13:54 2003
+++ edited/kernel/fork.c	Sat Nov  8 11:53:20 2003
@@ -1046,6 +1046,7 @@
 	} else
 		link_pid(p, p->pids + PIDTYPE_TGID, &p->group_leader->pids[PIDTYPE_TGID].pid);
 
+	p->ioprio = current->ioprio;
 	nr_threads++;
 	write_unlock_irq(&tasklist_lock);
 	retval = 0;
--- /dev/null	2003-09-24 00:19:32.000000000 +0200
+++ linux-2.6-cfq/drivers/block/cfq-iosched.c	2003-11-08 12:23:43.000000000 +0100
@@ -0,0 +1,1169 @@
+/*
+ *  linux/drivers/block/cfq-iosched.c
+ *
+ *  CFQ, or complete fairness queueing, disk scheduler.
+ *
+ *  Based on ideas from a previously unfinished io
+ *  scheduler (round robin per-process disk scheduling) and Andrea Arcangeli.
+ *
+ *  IO priorities are supported, from 0% to 100% in 5% increments. Both of
+ *  those values have special meaning - 0% class is allowed to do io if
+ *  noone else wants to use the disk. 100% is considered real-time io, and
+ *  always get priority. Default process io rate is 95%. In absence of other
+ *  io, a class may consume 100% disk bandwidth regardless. Withing a class,
+ *  bandwidth is distributed equally among the citizens.
+ *
+ * TODO:
+ *	- cfq_select_requests() needs some work for 5-95% io
+ *	- barriers not supported
+ *
+ *  Copyright (C) 2003 Jens Axboe <axboe@suse.de>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/hash.h>
+#include <linux/rbtree.h>
+#include <linux/mempool.h>
+
+#if IOPRIO_NR > BITS_PER_LONG
+#error Cannot support this many io priority levels
+#endif
+
+/*
+ * tunables
+ */
+static int cfq_quantum = 4;
+static int cfq_idle_quantum = 2;
+static int cfq_queued = 4;
+static int cfq_grace_rt = HZ / 150 ?: 1;
+static int cfq_grace_idle = HZ / 2;
+
+#define CFQ_QHASH_SHIFT		6
+#define CFQ_QHASH_ENTRIES	(1 << CFQ_QHASH_SHIFT)
+#define list_entry_qhash(entry)	list_entry((entry), struct cfq_queue, cfq_hash)
+
+#define CFQ_MHASH_SHIFT		8
+#define CFQ_MHASH_BLOCK(sec)	((sec) >> 3)
+#define CFQ_MHASH_ENTRIES	(1 << CFQ_MHASH_SHIFT)
+#define CFQ_MHASH_FN(sec)	(hash_long(CFQ_MHASH_BLOCK((sec)),CFQ_MHASH_SHIFT))
+#define ON_MHASH(crq)		!list_empty(&(crq)->hash)
+#define rq_hash_key(rq)		((rq)->sector + (rq)->nr_sectors)
+#define list_entry_hash(ptr)	list_entry((ptr), struct cfq_rq, hash)
+
+#define list_entry_cfqq(ptr)	list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_prio(ptr)	list_entry((ptr), struct cfq_rq, prio_list)
+
+#define RQ_DATA(rq)		((struct cfq_rq *) (rq)->elevator_private)
+
+#define cfq_account_io(crq)	\
+	((crq)->ioprio != IOPRIO_IDLE && (crq)->ioprio != IOPRIO_RT)
+
+/*
+ * defines how we distribute bandwidth (can be tgid, uid, etc)
+ */
+#define cfq_hash_key(current)	((current)->tgid)
+
+/*
+ * move to io_context
+ */
+#define cfq_ioprio(current)	((current)->ioprio)
+
+#define CFQ_WAIT_RT	0
+#define CFQ_WAIT_NORM	1
+
+static kmem_cache_t *crq_pool;
+static kmem_cache_t *cfq_pool;
+static mempool_t *cfq_mpool;
+
+/*
+ * defines an io class
+ */
+struct class_io_data {
+	struct list_head rr_list;
+	int busy_queues;
+	int busy_rq;
+	struct list_head prio_list;
+	int last_queued;
+};
+
+/*
+ * per-request queue structure
+ */
+struct cfq_data {
+	struct list_head *dispatch;
+	struct list_head *cfq_hash;
+	struct list_head *crq_hash;
+	mempool_t *crq_pool;
+
+	struct class_io_data cid[IOPRIO_NR];
+
+	/*
+	 * total number of busy queues and requests
+	 */
+	int busy_rq;
+	int busy_queues;
+
+	unsigned long rq_starved_mask;
+
+	/*
+	 * grace period handling
+	 */
+	struct timer_list timer;
+	unsigned long wait_end;
+	unsigned long flags;
+	struct work_struct work;
+
+	/*
+	 * tunables
+	 */
+	unsigned int cfq_quantum;
+	unsigned int cfq_idle_quantum;
+	unsigned int cfq_queued;
+	unsigned int cfq_grace_rt;
+	unsigned int cfq_grace_idle;
+};
+
+/*
+ * per-class structure
+ */
+struct cfq_queue {
+	struct list_head cfq_list;
+	struct list_head cfq_hash;
+	int hash_key;
+	struct rb_root sort_list;
+	int queued[2];
+	int ioprio;
+};
+
+/*
+ * per-request structure
+ */
+struct cfq_rq {
+	struct cfq_queue *cfq_queue;
+	struct rb_node rb_node;
+	struct list_head hash;
+	sector_t rb_key;
+
+	struct request *request;
+
+	struct list_head prio_list;
+	int ioprio;
+};
+
+static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq);
+static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int pid);
+static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq);
+
+/*
+ * lots of deadline iosched dupes, can be abstracted later...
+ */
+static inline void __cfq_del_crq_hash(struct cfq_rq *crq)
+{
+	list_del_init(&crq->hash);
+}
+
+static inline void cfq_del_crq_hash(struct cfq_rq *crq)
+{
+	if (ON_MHASH(crq))
+		__cfq_del_crq_hash(crq);
+}
+
+static inline void
+cfq_remove_merge_hints(request_queue_t *q, struct cfq_rq *crq)
+{
+	cfq_del_crq_hash(crq);
+
+	if (q->last_merge == crq->request)
+		q->last_merge = NULL;
+}
+
+static inline void cfq_add_crq_hash(struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	struct request *rq = crq->request;
+
+	BUG_ON(ON_MHASH(crq));
+
+	list_add(&crq->hash, &cfqd->crq_hash[CFQ_MHASH_FN(rq_hash_key(rq))]);
+}
+
+static struct request *cfq_find_rq_hash(struct cfq_data *cfqd, sector_t offset)
+{
+	struct list_head *hash_list = &cfqd->crq_hash[CFQ_MHASH_FN(offset)];
+	struct list_head *entry, *next = hash_list->next;
+
+	while ((entry = next) != hash_list) {
+		struct cfq_rq *crq = list_entry_hash(entry);
+		struct request *__rq = crq->request;
+
+		next = entry->next;
+
+		BUG_ON(!ON_MHASH(crq));
+
+		if (!rq_mergeable(__rq)) {
+			__cfq_del_crq_hash(crq);
+			continue;
+		}
+
+		if (rq_hash_key(__rq) == offset)
+			return __rq;
+	}
+
+	return NULL;
+}
+
+/*
+ * rb tree support functions
+ */
+#define RB_NONE			(2)
+#define RB_EMPTY(node)		((node)->rb_node == NULL)
+#define RB_CLEAR(node)		((node)->rb_color = RB_NONE)
+#define RB_CLEAR_ROOT(root)	((root)->rb_node = NULL)
+#define ON_RB(node)		((node)->rb_color != RB_NONE)
+#define rb_entry_crq(node)	rb_entry((node), struct cfq_rq, rb_node)
+#define rq_rb_key(rq)		(rq)->sector
+
+static void
+cfq_del_crq_rb(struct cfq_data *cfqd, struct cfq_queue *cfqq,struct cfq_rq *crq)
+{
+	if (ON_RB(&crq->rb_node)) {
+		if (cfq_account_io(crq)) {
+			cfqd->busy_rq--;
+			cfqd->cid[crq->ioprio].busy_rq--;
+		}
+		cfqq->queued[rq_data_dir(crq->request)]--;
+		rb_erase(&crq->rb_node, &cfqq->sort_list);
+		RB_CLEAR(&crq->rb_node);
+		crq->cfq_queue = NULL;
+	}
+}
+
+static struct cfq_rq *
+__cfq_add_crq_rb(struct cfq_queue *cfqq, struct cfq_rq *crq)
+{
+	struct rb_node **p = &cfqq->sort_list.rb_node;
+	struct rb_node *parent = NULL;
+	struct cfq_rq *__crq;
+
+	while (*p) {
+		parent = *p;
+		__crq = rb_entry_crq(parent);
+
+		if (crq->rb_key < __crq->rb_key)
+			p = &(*p)->rb_left;
+		else if (crq->rb_key > __crq->rb_key)
+			p = &(*p)->rb_right;
+		else
+			return __crq;
+	}
+
+	rb_link_node(&crq->rb_node, parent, p);
+	return 0;
+}
+
+static void
+cfq_add_crq_rb(struct cfq_data *cfqd, struct cfq_queue *cfqq,struct cfq_rq *crq)
+{
+	struct request *rq = crq->request;
+	struct cfq_rq *__alias;
+
+	crq->rb_key = rq_rb_key(rq);
+	cfqq->queued[rq_data_dir(rq)]++;
+	if (cfq_account_io(crq)) {
+		cfqd->busy_rq++;
+		cfqd->cid[crq->ioprio].busy_rq++;
+	}
+retry:
+	__alias = __cfq_add_crq_rb(cfqq, crq);
+	if (!__alias) {
+		rb_insert_color(&crq->rb_node, &cfqq->sort_list);
+		crq->cfq_queue = cfqq;
+		return;
+	}
+
+	cfq_del_crq_rb(cfqd, cfqq, __alias);
+	cfq_dispatch_sort(cfqd->dispatch, __alias);
+	goto retry;
+}
+
+static struct request *
+cfq_find_rq_rb(struct cfq_data *cfqd, sector_t sector)
+{
+	struct cfq_queue *cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	struct rb_node *n;
+
+	if (!cfqq)
+		goto out;
+
+	n = cfqq->sort_list.rb_node;
+	while (n) {
+		struct cfq_rq *crq = rb_entry_crq(n);
+
+		if (sector < crq->rb_key)
+			n = n->rb_left;
+		else if (sector > crq->rb_key)
+			n = n->rb_right;
+		else
+			return crq->request;
+	}
+
+out:
+	return NULL;
+}
+
+static void cfq_remove_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	if (crq) {
+		cfq_remove_merge_hints(q, crq);
+		list_del_init(&crq->prio_list);
+		list_del_init(&rq->queuelist);
+
+		/*
+		 * set a grace period timer to allow realtime io to make real
+		 * progress, if we release an rt request. for normal request,
+		 * set timer so idle io doesn't interfere with other io
+		 */
+		if (crq->ioprio == IOPRIO_RT) {
+			set_bit(CFQ_WAIT_RT, &cfqd->flags);
+			cfqd->wait_end = jiffies + cfqd->cfq_grace_rt;
+		} else if (crq->ioprio != IOPRIO_IDLE) {
+			set_bit(CFQ_WAIT_NORM, &cfqd->flags);
+			cfqd->wait_end = jiffies + cfqd->cfq_grace_idle;
+		}
+
+		if (ON_RB(&crq->rb_node)) {
+			struct cfq_queue *cfqq = crq->cfq_queue;
+
+			cfq_del_crq_rb(cfqd, cfqq, crq);
+
+			if (RB_EMPTY(&cfqq->sort_list))
+				cfq_put_queue(cfqd, cfqq);
+		}
+	}
+}
+
+static int
+cfq_merge(request_queue_t *q, struct request **req, struct bio *bio)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct request *__rq;
+	int ret;
+
+	ret = elv_try_last_merge(q, bio);
+	if (ret != ELEVATOR_NO_MERGE) {
+		__rq = q->last_merge;
+		goto out_insert;
+	}
+
+	__rq = cfq_find_rq_hash(cfqd, bio->bi_sector);
+	if (__rq) {
+		BUG_ON(__rq->sector + __rq->nr_sectors != bio->bi_sector);
+
+		if (elv_rq_merge_ok(__rq, bio)) {
+			ret = ELEVATOR_BACK_MERGE;
+			goto out;
+		}
+	}
+
+	__rq = cfq_find_rq_rb(cfqd, bio->bi_sector + bio_sectors(bio));
+	if (__rq) {
+		if (elv_rq_merge_ok(__rq, bio)) {
+			ret = ELEVATOR_FRONT_MERGE;
+			goto out;
+		}
+	}
+
+	return ELEVATOR_NO_MERGE;
+out:
+	q->last_merge = __rq;
+out_insert:
+	*req = __rq;
+	return ret;
+}
+
+static void cfq_merged_request(request_queue_t *q, struct request *req)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(req);
+
+	cfq_del_crq_hash(crq);
+	cfq_add_crq_hash(cfqd, crq);
+
+	if (ON_RB(&crq->rb_node) && (rq_rb_key(req) != crq->rb_key)) {
+		struct cfq_queue *cfqq = crq->cfq_queue;
+
+		cfq_del_crq_rb(cfqd, cfqq, crq);
+		cfq_add_crq_rb(cfqd, cfqq, crq);
+	}
+
+	q->last_merge = req;
+}
+
+static void
+cfq_merged_requests(request_queue_t *q, struct request *req,
+		    struct request *next)
+{
+	cfq_merged_request(q, req);
+	cfq_remove_request(q, next);
+}
+
+/*
+ * sort into dispatch list, in optimal ascending order
+ */
+static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq)
+{
+	struct list_head *entry = head;
+	struct request *__rq;
+
+	if (!list_empty(head)) {
+		__rq = list_entry_rq(head->next);
+
+		if (crq->request->sector < __rq->sector) {
+			entry = head->prev;
+			goto link;
+		}
+	}
+
+	while ((entry = entry->prev) != head) {
+		__rq = list_entry_rq(entry);
+
+		if (crq->request->sector <= __rq->sector)
+			break;
+	}
+
+link:
+	list_add_tail(&crq->request->queuelist, entry);
+}
+
+/*
+ * remove from io scheduler core and put on dispatch list for service
+ */
+static inline void
+__cfq_dispatch_requests(request_queue_t *q, struct cfq_data *cfqd,
+			struct cfq_queue *cfqq)
+{
+	struct cfq_rq *crq;
+
+	crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+
+	cfq_del_crq_rb(cfqd, cfqq, crq);
+	cfq_remove_merge_hints(q, crq);
+	cfq_dispatch_sort(cfqd->dispatch, crq);
+
+	/*
+	 * technically, for IOPRIO_RT we don't need to add it to the list.
+	 */
+	list_add_tail(&crq->prio_list, &cfqd->cid[cfqq->ioprio].prio_list);
+}
+
+static int
+cfq_dispatch_requests(request_queue_t *q, int prio, int may_queue)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct list_head *plist = &cfqd->cid[prio].rr_list;
+	struct list_head *entry, *nxt;
+	int queued = 0;
+
+	/*
+	 * for each queue at this prio level, dispatch a request
+	 */
+	list_for_each_safe(entry, nxt, plist) {
+		struct cfq_queue *cfqq = list_entry_cfqq(entry);
+
+		BUG_ON(RB_EMPTY(&cfqq->sort_list));
+
+		__cfq_dispatch_requests(q, cfqd, cfqq);
+		queued++;
+
+		if (RB_EMPTY(&cfqq->sort_list))
+			cfq_put_queue(cfqd, cfqq);
+
+		/*
+		 * if we hit the queue limit, put the string of serviced
+		 * queues at the back of the pending list
+		 */
+		if (queued == may_queue) {
+			struct list_head *prv = nxt->prev;
+
+			if (prv != plist) {
+				list_del(plist);
+				list_add(plist, prv);
+			}
+			break;
+		}
+	}
+
+	return queued;
+}
+
+/*
+ * try to move some requests to the dispatch list. return 0 on success
+ */
+static int cfq_select_requests(request_queue_t *q, struct cfq_data *cfqd)
+{
+	int queued, ios, i;
+
+	/*
+	 * if there's any realtime io, only schedule that
+	 */
+	if (cfq_dispatch_requests(q, IOPRIO_RT, cfqd->cfq_quantum))
+		return 1;
+
+	/*
+	 * if RT io was last serviced and grace time hasn't expired,
+	 * arm the timer to restart queueing if no other RT io has been
+	 * submitted in the mean time
+	 */
+	if (test_bit(CFQ_WAIT_RT, &cfqd->flags)) {
+		if (time_before(jiffies, cfqd->wait_end)) {
+			mod_timer(&cfqd->timer, cfqd->wait_end);
+			return 0;
+		}
+		clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+	}
+
+	/*
+	 * for each priority class, calculate number of requests we
+	 * are allowed to put into service.
+	 */
+	queued = 0;
+	ios = cfqd->busy_rq;
+	for (i = IOPRIO_RT - 1; i > IOPRIO_IDLE; i--) {
+		const int other_io = ios - cfqd->cid[i].busy_rq;
+		int may_queue = cfqd->cfq_quantum;
+		int dispatched;
+
+		/*
+		 * no need to keep iterating the list, if there are no
+		 * requests pending anymore
+		 */
+		if (!cfqd->busy_rq)
+			break;
+
+		if (other_io)
+			may_queue = other_io * (i + 1) / IOPRIO_NR;
+		if (may_queue > cfqd->cfq_quantum)
+			may_queue = cfqd->cfq_quantum;
+		/*
+		 * average with last dispatched for fairness
+		 */
+		if (cfqd->cid[i].last_queued != -1)
+			may_queue = (cfqd->cid[i].last_queued + may_queue) / 2;
+
+		dispatched = cfq_dispatch_requests(q, i, may_queue);
+
+		cfqd->cid[i].last_queued = dispatched;
+		queued += dispatched;
+	}
+
+	if (queued)
+		return 1;
+
+	/*
+	 * only allow dispatch of idle io, if the queue has been idle from
+	 * servicing RT or normal io for the grace period
+	 */
+	if (test_bit(CFQ_WAIT_NORM, &cfqd->flags)) {
+		if (time_before(jiffies, cfqd->wait_end)) {
+			mod_timer(&cfqd->timer, cfqd->wait_end);
+			return 0;
+		}
+		clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+	}
+
+	/*
+	 * if we found nothing to do, allow idle io to be serviced
+	 */
+	if (cfq_dispatch_requests(q, IOPRIO_IDLE, cfqd->cfq_idle_quantum))
+		return 1;
+
+	return 0;
+}
+
+static struct request *cfq_next_request(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct request *rq;
+
+	if (!list_empty(cfqd->dispatch)) {
+		struct cfq_rq *crq;
+dispatch:
+		/*
+		 * end grace period, we are servicing a request
+		 */
+		del_timer(&cfqd->timer);
+		clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+		clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+
+		BUG_ON(list_empty(cfqd->dispatch));
+		rq = list_entry_rq(cfqd->dispatch->next);
+
+		BUG_ON(q->last_merge == rq);
+		crq = RQ_DATA(rq);
+		if (crq) {
+			BUG_ON(ON_MHASH(crq));
+			list_del_init(&crq->prio_list);
+		}
+
+		return rq;
+	}
+
+	/*
+	 * we moved requests to dispatch list, go back end serve one
+	 */
+	if (cfq_select_requests(q, cfqd))
+		goto dispatch;
+
+	return NULL;
+}
+
+static inline struct cfq_queue *
+__cfq_find_cfq_hash(struct cfq_data *cfqd, int hashkey, const int hashval)
+{
+	struct list_head *hash_list = &cfqd->cfq_hash[hashval];
+	struct list_head *entry;
+
+	list_for_each(entry, hash_list) {
+		struct cfq_queue *__cfqq = list_entry_qhash(entry);
+
+		if (__cfqq->hash_key == hashkey)
+			return __cfqq;
+	}
+
+	return NULL;
+}
+
+static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int hashkey)
+{
+	const int hashval = hash_long(hashkey, CFQ_QHASH_SHIFT);
+
+	return __cfq_find_cfq_hash(cfqd, hashkey, hashval);
+}
+
+static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	cfqd->busy_queues--;
+	WARN_ON(cfqd->busy_queues < 0);
+
+	cfqd->cid[cfqq->ioprio].busy_queues--;
+	WARN_ON(cfqd->cid[cfqq->ioprio].busy_queues < 0);
+
+	list_del(&cfqq->cfq_list);
+	list_del(&cfqq->cfq_hash);
+	mempool_free(cfqq, cfq_mpool);
+}
+
+static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, int hashkey)
+{
+	const int hashval = hash_long(hashkey, CFQ_QHASH_SHIFT);
+	struct cfq_queue *cfqq = __cfq_find_cfq_hash(cfqd, hashkey, hashval);
+
+	if (!cfqq) {
+		cfqq = mempool_alloc(cfq_mpool, GFP_NOIO);
+
+		INIT_LIST_HEAD(&cfqq->cfq_hash);
+		INIT_LIST_HEAD(&cfqq->cfq_list);
+		RB_CLEAR_ROOT(&cfqq->sort_list);
+
+		cfqq->hash_key = cfq_hash_key(current);
+		cfqq->queued[0] = cfqq->queued[1] = 0;
+		cfqq->ioprio = cfq_ioprio(current);
+		list_add(&cfqq->cfq_hash, &cfqd->cfq_hash[hashval]);
+	}
+
+	return cfqq;
+}
+
+static void
+__cfq_enqueue(request_queue_t *q, struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	const int prio = crq->ioprio;
+	struct cfq_queue *cfqq;
+
+	cfqq = cfq_get_queue(cfqd, cfq_hash_key(current));
+
+	/*
+	 * not too good...
+	 */
+	if (prio > cfqq->ioprio) {
+		if (!list_empty(&cfqq->cfq_list)) {
+			cfqd->cid[cfqq->ioprio].busy_queues--;
+			WARN_ON(cfqd->cid[cfqq->ioprio].busy_queues < 0);
+			cfqd->cid[prio].busy_queues++;
+			list_move_tail(&cfqq->cfq_list,&cfqd->cid[prio].rr_list);
+		}
+		cfqq->ioprio = prio;
+	}
+
+	cfq_add_crq_rb(cfqd, cfqq, crq);
+
+	if (list_empty(&cfqq->cfq_list)) {
+		list_add_tail(&cfqq->cfq_list, &cfqd->cid[prio].rr_list);
+		cfqd->cid[prio].busy_queues++;
+		cfqd->busy_queues++;
+	}
+
+	if (rq_mergeable(crq->request)) {
+		cfq_add_crq_hash(cfqd, crq);
+
+		if (!q->last_merge)
+			q->last_merge = crq->request;
+	}
+
+}
+
+static void cfq_reenqueue(request_queue_t *q, struct cfq_data *cfqd, int prio)
+{
+	struct list_head *prio_list = &cfqd->cid[prio].prio_list;
+	struct list_head *entry, *tmp;
+
+	list_for_each_safe(entry, tmp, prio_list) {
+		struct cfq_rq *crq = list_entry_prio(entry);
+
+		list_del_init(entry);
+		list_del_init(&crq->request->queuelist);
+		__cfq_enqueue(q, cfqd, crq);
+	}
+}
+
+static void
+cfq_enqueue(request_queue_t *q, struct cfq_data *cfqd, struct cfq_rq *crq)
+{
+	const int prio = cfq_ioprio(current);
+
+	crq->ioprio = prio;
+	__cfq_enqueue(q, cfqd, crq);
+
+	if (prio == IOPRIO_RT) {
+		int i;
+
+		/*
+		 * realtime io gets priority, move all other io back
+		 */
+		for (i = IOPRIO_IDLE; i < IOPRIO_RT; i++)
+			cfq_reenqueue(q, cfqd, i);
+	} else if (prio != IOPRIO_IDLE) {
+		/*
+		 * check if we need to move idle io back into queue
+		 */
+		cfq_reenqueue(q, cfqd, IOPRIO_IDLE);
+	}
+}
+
+static void
+cfq_insert_request(request_queue_t *q, struct request *rq, int where)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	switch (where) {
+		case ELEVATOR_INSERT_BACK:
+#if 0
+			while (cfq_dispatch_requests(q, cfqd))
+				;
+#endif
+			list_add_tail(&rq->queuelist, cfqd->dispatch);
+			break;
+		case ELEVATOR_INSERT_FRONT:
+			list_add(&rq->queuelist, cfqd->dispatch);
+			break;
+		case ELEVATOR_INSERT_SORT:
+			BUG_ON(!blk_fs_request(rq));
+			cfq_enqueue(q, cfqd, crq);
+			break;
+		default:
+			printk("%s: bad insert point %d\n", __FUNCTION__,where);
+			return;
+	}
+}
+
+static int cfq_queue_empty(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+
+	if (list_empty(cfqd->dispatch) && !cfqd->busy_queues)
+		return 1;
+
+	return 0;
+}
+
+static struct request *
+cfq_former_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_rq *crq = RQ_DATA(rq);
+	struct rb_node *rbprev = rb_prev(&crq->rb_node);
+
+	if (rbprev)
+		return rb_entry_crq(rbprev)->request;
+
+	return NULL;
+}
+
+static struct request *
+cfq_latter_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_rq *crq = RQ_DATA(rq);
+	struct rb_node *rbnext = rb_next(&crq->rb_node);
+
+	if (rbnext)
+		return rb_entry_crq(rbnext)->request;
+
+	return NULL;
+}
+
+static void cfq_queue_congested(request_queue_t *q)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+
+	cfqd->rq_starved_mask |= 1 << cfq_ioprio(current);
+}
+
+static int cfq_may_queue(request_queue_t *q, int rw)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_queue *cfqq;
+	const int prio = cfq_ioprio(current);
+	int limit, ret = 1;
+
+	if (!cfqd->busy_queues)
+		goto out;
+
+	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	if (!cfqq)
+		goto out;
+
+	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(current));
+	if (!cfqq)
+		goto out;
+
+	/*
+	 * if higher or equal prio io is sleeping waiting for a request, don't
+	 * allow this one to allocate one. as long as ll_rw_blk does fifo
+	 * waitqueue wakeups this should work...
+	 */
+	if (cfqd->rq_starved_mask & (1 << prio))
+		goto out;
+
+	if (cfqq->queued[rw] < cfqd->cfq_queued || !cfqd->cid[prio].busy_queues)
+		goto out;
+
+	limit = q->nr_requests * (prio + 1) / IOPRIO_NR;
+	limit /= cfqd->cid[prio].busy_queues;
+	if (cfqq->queued[rw] > limit)
+		ret = 0;
+
+out:
+	return ret;
+}
+
+static void cfq_put_request(request_queue_t *q, struct request *rq)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = RQ_DATA(rq);
+
+	if (crq) {
+		BUG_ON(q->last_merge == rq);
+		BUG_ON(ON_MHASH(crq));
+
+		mempool_free(crq, cfqd->crq_pool);
+		rq->elevator_private = NULL;
+	}
+}
+
+static int cfq_set_request(request_queue_t *q, struct request *rq, int gfp_mask)
+{
+	struct cfq_data *cfqd = q->elevator.elevator_data;
+	struct cfq_rq *crq = mempool_alloc(cfqd->crq_pool, gfp_mask);
+
+	if (crq) {
+		/*
+		 * process now has one request
+		 */
+		cfqd->rq_starved_mask &= ~(1 << cfq_ioprio(current));
+
+		memset(crq, 0, sizeof(*crq));
+		RB_CLEAR(&crq->rb_node);
+		crq->request = rq;
+		INIT_LIST_HEAD(&crq->hash);
+		INIT_LIST_HEAD(&crq->prio_list);
+		rq->elevator_private = crq;
+		return 0;
+	}
+
+	return 1;
+}
+
+static void cfq_exit(request_queue_t *q, elevator_t *e)
+{
+	struct cfq_data *cfqd = e->elevator_data;
+
+	e->elevator_data = NULL;
+	mempool_destroy(cfqd->crq_pool);
+	kfree(cfqd->crq_hash);
+	kfree(cfqd->cfq_hash);
+	kfree(cfqd);
+}
+
+static void cfq_timer(unsigned long data)
+{
+	struct cfq_data *cfqd = (struct cfq_data *) data;
+
+	clear_bit(CFQ_WAIT_RT, &cfqd->flags);
+	clear_bit(CFQ_WAIT_NORM, &cfqd->flags);
+	kblockd_schedule_work(&cfqd->work);
+}
+
+static void cfq_work(void *data)
+{
+	request_queue_t *q = data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (cfq_next_request(q))
+		q->request_fn(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static int cfq_init(request_queue_t *q, elevator_t *e)
+{
+	struct cfq_data *cfqd;
+	int i;
+
+	cfqd = kmalloc(sizeof(*cfqd), GFP_KERNEL);
+	if (!cfqd)
+		return -ENOMEM;
+
+	memset(cfqd, 0, sizeof(*cfqd));
+
+	init_timer(&cfqd->timer);
+	cfqd->timer.function = cfq_timer;
+	cfqd->timer.data = (unsigned long) cfqd;
+
+	INIT_WORK(&cfqd->work, cfq_work, q);
+
+	for (i = 0; i < IOPRIO_NR; i++) {
+		struct class_io_data *cid = &cfqd->cid[i];
+
+		INIT_LIST_HEAD(&cid->rr_list);
+		INIT_LIST_HEAD(&cid->prio_list);
+		cid->last_queued = -1;
+	}
+
+	cfqd->crq_hash = kmalloc(sizeof(struct list_head) * CFQ_MHASH_ENTRIES, GFP_KERNEL);
+	if (!cfqd->crq_hash)
+		goto out_crqhash;
+
+	cfqd->cfq_hash = kmalloc(sizeof(struct list_head) * CFQ_QHASH_ENTRIES, GFP_KERNEL);
+	if (!cfqd->cfq_hash)
+		goto out_cfqhash;
+
+	cfqd->crq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, crq_pool);
+	if (!cfqd->crq_pool)
+		goto out_crqpool;
+
+	for (i = 0; i < CFQ_MHASH_ENTRIES; i++)
+		INIT_LIST_HEAD(&cfqd->crq_hash[i]);
+	for (i = 0; i < CFQ_QHASH_ENTRIES; i++)
+		INIT_LIST_HEAD(&cfqd->cfq_hash[i]);
+
+	cfqd->cfq_queued = cfq_queued;
+	cfqd->cfq_quantum = cfq_quantum;
+	cfqd->cfq_idle_quantum = cfq_idle_quantum;
+	cfqd->cfq_grace_rt = cfq_grace_rt;
+	cfqd->cfq_grace_idle = cfq_grace_idle;
+
+	q->nr_requests <<= 2;
+
+	cfqd->dispatch = &q->queue_head;
+	e->elevator_data = cfqd;
+
+	return 0;
+out_crqpool:
+	kfree(cfqd->cfq_hash);
+out_cfqhash:
+	kfree(cfqd->crq_hash);
+out_crqhash:
+	kfree(cfqd);
+	return -ENOMEM;
+}
+
+static int __init cfq_slab_setup(void)
+{
+	crq_pool = kmem_cache_create("crq_pool", sizeof(struct cfq_rq), 0, 0,
+					NULL, NULL);
+
+	if (!crq_pool)
+		panic("cfq_iosched: can't init crq pool\n");
+
+	cfq_pool = kmem_cache_create("cfq_pool", sizeof(struct cfq_queue), 0, 0,
+					NULL, NULL);
+
+	if (!cfq_pool)
+		panic("cfq_iosched: can't init cfq pool\n");
+
+	cfq_mpool = mempool_create(64, mempool_alloc_slab, mempool_free_slab, cfq_pool);
+
+	if (!cfq_mpool)
+		panic("cfq_iosched: can't init cfq mpool\n");
+
+	return 0;
+}
+
+subsys_initcall(cfq_slab_setup);
+
+/*
+ * sysfs parts below -->
+ */
+struct cfq_fs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct cfq_data *, char *);
+	ssize_t (*store)(struct cfq_data *, const char *, size_t);
+};
+
+static ssize_t
+cfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+cfq_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR)					\
+static ssize_t __FUNC(struct cfq_data *cfqd, char *page)		\
+{									\
+	return cfq_var_show(__VAR, (page));				\
+}
+SHOW_FUNCTION(cfq_quantum_show, cfqd->cfq_quantum);
+SHOW_FUNCTION(cfq_idle_quantum_show, cfqd->cfq_idle_quantum);
+SHOW_FUNCTION(cfq_queued_show, cfqd->cfq_queued);
+SHOW_FUNCTION(cfq_grace_rt_show, cfqd->cfq_grace_rt);
+SHOW_FUNCTION(cfq_grace_idle_show, cfqd->cfq_grace_idle);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)				\
+static ssize_t __FUNC(struct cfq_data *cfqd, const char *page, size_t count)	\
+{									\
+	int ret = cfq_var_store(__PTR, (page), count);			\
+	if (*(__PTR) < (MIN))						\
+		*(__PTR) = (MIN);					\
+	else if (*(__PTR) > (MAX))					\
+		*(__PTR) = (MAX);					\
+	return ret;							\
+}
+STORE_FUNCTION(cfq_quantum_store, &cfqd->cfq_quantum, 0, INT_MAX);
+STORE_FUNCTION(cfq_idle_quantum_store, &cfqd->cfq_idle_quantum, 0, INT_MAX);
+STORE_FUNCTION(cfq_queued_store, &cfqd->cfq_queued, 0, INT_MAX);
+STORE_FUNCTION(cfq_grace_rt_store, &cfqd->cfq_grace_rt, 0, INT_MAX);
+STORE_FUNCTION(cfq_grace_idle_store, &cfqd->cfq_grace_idle, 0, INT_MAX);
+#undef STORE_FUNCTION
+
+static struct cfq_fs_entry cfq_quantum_entry = {
+	.attr = {.name = "quantum", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_quantum_show,
+	.store = cfq_quantum_store,
+};
+static struct cfq_fs_entry cfq_idle_quantum_entry = {
+	.attr = {.name = "idle_quantum", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_idle_quantum_show,
+	.store = cfq_idle_quantum_store,
+};
+static struct cfq_fs_entry cfq_queued_entry = {
+	.attr = {.name = "queued", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_queued_show,
+	.store = cfq_queued_store,
+};
+static struct cfq_fs_entry cfq_grace_rt_entry = {
+	.attr = {.name = "grace_rt", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_grace_rt_show,
+	.store = cfq_grace_rt_store,
+};
+static struct cfq_fs_entry cfq_grace_idle_entry = {
+	.attr = {.name = "grace_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_grace_idle_show,
+	.store = cfq_grace_idle_store,
+};
+
+static struct attribute *default_attrs[] = {
+	&cfq_quantum_entry.attr,
+	&cfq_idle_quantum_entry.attr,
+	&cfq_queued_entry.attr,
+	&cfq_grace_rt_entry.attr,
+	&cfq_grace_idle_entry.attr,
+	NULL,
+};
+
+#define to_cfq(atr) container_of((atr), struct cfq_fs_entry, attr)
+
+static ssize_t
+cfq_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
+{
+	elevator_t *e = container_of(kobj, elevator_t, kobj);
+	struct cfq_fs_entry *entry = to_cfq(attr);
+
+	if (!entry->show)
+		return 0;
+
+	return entry->show(e->elevator_data, page);
+}
+
+static ssize_t
+cfq_attr_store(struct kobject *kobj, struct attribute *attr,
+	       const char *page, size_t length)
+{
+	elevator_t *e = container_of(kobj, elevator_t, kobj);
+	struct cfq_fs_entry *entry = to_cfq(attr);
+
+	if (!entry->store)
+		return -EINVAL;
+
+	return entry->store(e->elevator_data, page, length);
+}
+
+static struct sysfs_ops cfq_sysfs_ops = {
+	.show	= cfq_attr_show,
+	.store	= cfq_attr_store,
+};
+
+struct kobj_type cfq_ktype = {
+	.sysfs_ops	= &cfq_sysfs_ops,
+	.default_attrs	= default_attrs,
+};
+
+elevator_t iosched_cfq = {
+	.elevator_name =		"cfq",
+	.elevator_ktype =		&cfq_ktype,
+	.elevator_merge_fn = 		cfq_merge,
+	.elevator_merged_fn =		cfq_merged_request,
+	.elevator_merge_req_fn =	cfq_merged_requests,
+	.elevator_next_req_fn =		cfq_next_request,
+	.elevator_add_req_fn =		cfq_insert_request,
+	.elevator_remove_req_fn =	cfq_remove_request,
+	.elevator_queue_empty_fn =	cfq_queue_empty,
+	.elevator_former_req_fn =	cfq_former_request,
+	.elevator_latter_req_fn =	cfq_latter_request,
+	.elevator_set_req_fn =		cfq_set_request,
+	.elevator_put_req_fn =		cfq_put_request,
+	.elevator_may_queue_fn =	cfq_may_queue,
+	.elevator_set_congested_fn =	cfq_queue_congested,
+	.elevator_init_fn =		cfq_init,
+	.elevator_exit_fn =		cfq_exit,
+};
+
+EXPORT_SYMBOL(iosched_cfq);

-- 
Jens Axboe


[-- Attachment #2: ionice.c --]
[-- Type: text/plain, Size: 1039 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <getopt.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <asm/unistd.h>

extern int sys_ioprio_set(int);
extern int sys_ioprio_get(void);

#if X86
#define __NR_ioprio_set		274
#define __NR_ioprio_get		275
#elif PPC
#define __NR_ioprio_set		255
#define __NR_ioprio_get		256
#elif X86-64
#define __NR_ioprio_set		237
#define __NR_ioprio_get		238
#else
#error unkown arch
#endif

_syscall1(int, ioprio_set, int, ioprio);
_syscall0(int, ioprio_get);

int main(int argc, char *argv[])
{
	int ioprio = 2, set = 0, set_pid = 0;
	pid_t pid;
	int c;

	while ((c = getopt(argc, argv, "+n:+p:")) != EOF) {
		switch (c) {
		case 'n':
			ioprio = strtol(optarg, NULL, 10);
			set = 1;
			break;
		case 'p':
			pid = strtoul(optarg, NULL, 10);
			set_pid = 1;
			break;
		}
	}

	if (!set)
		printf("%d\n", ioprio_get());
	else if (argv[optind]) {
		if (ioprio_set(ioprio) == -1) {
			perror("ioprio_set");
			return 1;
		}

		execvp(argv[optind], &argv[optind]);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2003-11-18 13:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-10  1:49 [PATCH] cfq + io priorities Albert Cahalan
2003-11-10 10:19 ` Herbert Xu
2003-11-10 10:36   ` Ali Magik Rama
2003-11-10 13:07   ` [PATCH] cfq + io priorities Albert Cahalan
2003-11-10 13:31     ` P
2003-11-10 13:37       ` Jens Axboe
2003-11-10 13:57         ` P
2003-11-10 23:52         ` Albert Cahalan
2003-11-18  9:27           ` Pavel Machek
2003-11-11 17:46     ` Toon van der Pas
  -- strict thread matches above, loose matches on Subject: below --
2003-11-09 10:57 Guillaume Chazarain
2003-11-09 11:39 ` Jens Axboe
2003-11-13 12:54   ` Pavel Machek
2003-11-16 22:56     ` Jakob Oestergaard
2003-11-17  8:14     ` Jens Axboe
2003-11-18 13:26       ` Pavel Machek
2003-11-18 13:32         ` Jens Axboe
2003-11-18 13:38           ` Pavel Machek
2003-11-08 12:47 Jens Axboe
2003-11-08 13:25 ` Jens Axboe
2003-11-08 14:06   ` Jens Axboe
2003-11-09 21:30 ` Shailabh Nagar
2003-11-09 21:34   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).