* [PATCH 01/18] io-controller: Documentation
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-06 3:16 ` Gui Jianfeng
[not found] ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-05 19:58 ` Vivek Goyal
` (36 subsequent siblings)
37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 264 +++++++++++++++++++++++++++++++++
2 files changed, 266 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+ IO Controller
+ =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+ --------------------------------
+ | Elevator Layer + Fair Queuing |
+ --------------------------------
+ | | | |
+ NOOP DEADLINE AS CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+ hierarchical mode. One of the things is that we can not keep dividing
+ the time slice of parent group among childrens. Deeper we go in hierarchy
+ time slice will get smaller.
+
+ One of the ways to implement hierarchical support could be to keep track
+ of virtual time and service provided to queue/group and select a queue/group
+ for service based on any of the various available algoriths.
+
+ BFQ already had support for hierarchical scheduling, taking those patches
+ was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+ to a queue. Delay/Jitter with BFQ is O(1).
+
+ Note: BFQ originally used amount of IO done (number of sectors) as notion
+ of service provided. IOW, it tried to provide fairness in terms of
+ actual IO done and not in terms of actual time disk access was
+ given to a queue.
+
+ This patcheset modified BFQ to provide fairness in time domain because
+ that's what CFQ does. So idea was try not to deviate too much from
+ the CFQ behavior initially.
+
+ Providing fairness in time domain makes accounting trciky because
+ due to command queueing, at one time there might be multiple requests
+ from different queues and there is no easy way to find out how much
+ disk time actually was consumed by the requests of a particular
+ queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ hierarchical.
+
+CGROUP_BLKIO
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+ - Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+ benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+ are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+ and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+ complicated and maintains lots of data structures. Need to spend some time
+ to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+ the block devices in the system. Probably it will make more sense to
+ make it per cgroup per device setting so that a cgroup can have different
+ weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+ CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+ CONFIG_TRACK_ASYNC_CONTEXT=y
+
+ (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ controller.
+
+ mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.ioprio
+ echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+
+ echo 1 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/lv0/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/lv0/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+ (It is hard to achieve though :-)).
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 01/18] io-controller: Documentation
2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
@ 2009-05-06 3:16 ` Gui Jianfeng
[not found] ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-06 13:31 ` Vivek Goyal
[not found] ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06 3:16 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
> + mount -t cgroup -o io,blkio none /cgroup
> +
> +- Create two cgroups
> + mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set weights of group test1 and test2
> + echo 1000 > /cgroup/test1/io.ioprio
> + echo 500 > /cgroup/test2/io.ioprio
Here seems should be /cgroup/test2/io.weight
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> + launch two dd threads in different cgroup to read those files. Make sure
> + right io scheduler is being used for the block device where files are
> + present (the one you compiled in hierarchical mode).
> +
> + echo 1 > /proc/sys/vm/drop_caches
> +
> + dd if=/mnt/lv0/zerofile1 of=/dev/null &
> + echo $! > /cgroup/test1/tasks
> + cat /cgroup/test1/tasks
> +
> + dd if=/mnt/lv0/zerofile2 of=/dev/null &
> + echo $! > /cgroup/test2/tasks
> + cat /cgroup/test2/tasks
> +
> +- At macro level, first dd should finish first. To get more precise data, keep
> + on looking at (with the help of script), at io.disk_time and io.disk_sectors
> + files of both test1 and test2 groups. This will tell how much disk time
> + (in milli seconds), each group got and how many secotors each group
> + dispatched to the disk. We provide fairness in terms of disk time, so
> + ideally io.disk_time of cgroups should be in proportion to the weight.
> + (It is hard to achieve though :-)).
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 01/18] io-controller: Documentation
[not found] ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-06 13:31 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:31 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 06, 2009 at 11:16:04AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > + mount -t cgroup -o io,blkio none /cgroup
> > +
> > +- Create two cgroups
> > + mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set weights of group test1 and test2
> > + echo 1000 > /cgroup/test1/io.ioprio
> > + echo 500 > /cgroup/test2/io.ioprio
>
> Here seems should be /cgroup/test2/io.weight
>
Forgot to update these lines while switching from the notion of ioprio
to weight for the groups. Will do that next time.
Thanks
Vivek
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > + launch two dd threads in different cgroup to read those files. Make sure
> > + right io scheduler is being used for the block device where files are
> > + present (the one you compiled in hierarchical mode).
> > +
> > + echo 1 > /proc/sys/vm/drop_caches
> > +
> > + dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > + echo $! > /cgroup/test1/tasks
> > + cat /cgroup/test1/tasks
> > +
> > + dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > + echo $! > /cgroup/test2/tasks
> > + cat /cgroup/test2/tasks
> > +
> > +- At macro level, first dd should finish first. To get more precise data, keep
> > + on looking at (with the help of script), at io.disk_time and io.disk_sectors
> > + files of both test1 and test2 groups. This will tell how much disk time
> > + (in milli seconds), each group got and how many secotors each group
> > + dispatched to the disk. We provide fairness in terms of disk time, so
> > + ideally io.disk_time of cgroups should be in proportion to the weight.
> > + (It is hard to achieve though :-)).
>
> --
> Regards
> Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 01/18] io-controller: Documentation
2009-05-06 3:16 ` Gui Jianfeng
[not found] ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-06 13:31 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:31 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 06, 2009 at 11:16:04AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > + mount -t cgroup -o io,blkio none /cgroup
> > +
> > +- Create two cgroups
> > + mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set weights of group test1 and test2
> > + echo 1000 > /cgroup/test1/io.ioprio
> > + echo 500 > /cgroup/test2/io.ioprio
>
> Here seems should be /cgroup/test2/io.weight
>
Forgot to update these lines while switching from the notion of ioprio
to weight for the groups. Will do that next time.
Thanks
Vivek
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > + launch two dd threads in different cgroup to read those files. Make sure
> > + right io scheduler is being used for the block device where files are
> > + present (the one you compiled in hierarchical mode).
> > +
> > + echo 1 > /proc/sys/vm/drop_caches
> > +
> > + dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > + echo $! > /cgroup/test1/tasks
> > + cat /cgroup/test1/tasks
> > +
> > + dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > + echo $! > /cgroup/test2/tasks
> > + cat /cgroup/test2/tasks
> > +
> > +- At macro level, first dd should finish first. To get more precise data, keep
> > + on looking at (with the help of script), at io.disk_time and io.disk_sectors
> > + files of both test1 and test2 groups. This will tell how much disk time
> > + (in milli seconds), each group got and how many secotors each group
> > + dispatched to the disk. We provide fairness in terms of disk time, so
> > + ideally io.disk_time of cgroups should be in proportion to the weight.
> > + (It is hard to achieve though :-)).
>
> --
> Regards
> Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 01/18] io-controller: Documentation
[not found] ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 3:16 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06 3:16 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
> + mount -t cgroup -o io,blkio none /cgroup
> +
> +- Create two cgroups
> + mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set weights of group test1 and test2
> + echo 1000 > /cgroup/test1/io.ioprio
> + echo 500 > /cgroup/test2/io.ioprio
Here seems should be /cgroup/test2/io.weight
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> + launch two dd threads in different cgroup to read those files. Make sure
> + right io scheduler is being used for the block device where files are
> + present (the one you compiled in hierarchical mode).
> +
> + echo 1 > /proc/sys/vm/drop_caches
> +
> + dd if=/mnt/lv0/zerofile1 of=/dev/null &
> + echo $! > /cgroup/test1/tasks
> + cat /cgroup/test1/tasks
> +
> + dd if=/mnt/lv0/zerofile2 of=/dev/null &
> + echo $! > /cgroup/test2/tasks
> + cat /cgroup/test2/tasks
> +
> +- At macro level, first dd should finish first. To get more precise data, keep
> + on looking at (with the help of script), at io.disk_time and io.disk_sectors
> + files of both test1 and test2 groups. This will tell how much disk time
> + (in milli seconds), each group got and how many secotors each group
> + dispatched to the disk. We provide fairness in terms of disk time, so
> + ideally io.disk_time of cgroups should be in proportion to the weight.
> + (It is hard to achieve though :-)).
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH 01/18] io-controller: Documentation
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
` (35 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 264 +++++++++++++++++++++++++++++++++
2 files changed, 266 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+ IO Controller
+ =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+ --------------------------------
+ | Elevator Layer + Fair Queuing |
+ --------------------------------
+ | | | |
+ NOOP DEADLINE AS CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+ hierarchical mode. One of the things is that we can not keep dividing
+ the time slice of parent group among childrens. Deeper we go in hierarchy
+ time slice will get smaller.
+
+ One of the ways to implement hierarchical support could be to keep track
+ of virtual time and service provided to queue/group and select a queue/group
+ for service based on any of the various available algoriths.
+
+ BFQ already had support for hierarchical scheduling, taking those patches
+ was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+ to a queue. Delay/Jitter with BFQ is O(1).
+
+ Note: BFQ originally used amount of IO done (number of sectors) as notion
+ of service provided. IOW, it tried to provide fairness in terms of
+ actual IO done and not in terms of actual time disk access was
+ given to a queue.
+
+ This patcheset modified BFQ to provide fairness in time domain because
+ that's what CFQ does. So idea was try not to deviate too much from
+ the CFQ behavior initially.
+
+ Providing fairness in time domain makes accounting trciky because
+ due to command queueing, at one time there might be multiple requests
+ from different queues and there is no easy way to find out how much
+ disk time actually was consumed by the requests of a particular
+ queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ hierarchical.
+
+CGROUP_BLKIO
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+ - Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+ benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+ are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+ and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+ complicated and maintains lots of data structures. Need to spend some time
+ to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+ the block devices in the system. Probably it will make more sense to
+ make it per cgroup per device setting so that a cgroup can have different
+ weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+ CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+ CONFIG_TRACK_ASYNC_CONTEXT=y
+
+ (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ controller.
+
+ mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.ioprio
+ echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+
+ echo 1 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/lv0/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/lv0/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+ (It is hard to achieve though :-)).
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
` (34 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.
This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 13 +
block/Makefile | 1 +
block/blk-sysfs.c | 25 +
block/elevator-fq.c | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 488 +++++++++++
block/elevator.c | 46 +-
include/linux/blkdev.h | 5 +
include/linux/elevator.h | 51 ++
8 files changed, 2694 insertions(+), 11 deletions(-)
create mode 100644 block/elevator-fq.c
create mode 100644 block/elevator-fq.h
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
menu "IO Schedulers"
+config ELV_FAIR_QUEUING
+ bool "Elevator Fair Queuing Support"
+ default n
+ ---help---
+ Traditionally only cfq had notion of multiple queues and it did
+ fair queuing at its own. With the cgroups and need of controlling
+ IO, now even the simple io schedulers like noop, deadline, as will
+ have one queue per cgroup and will need hierarchical fair queuing.
+ Instead of every io scheduler implementing its own fair queuing
+ logic, this option enables fair queuing in elevator layer so that
+ other ioschedulers can make use of it.
+ If unsure, say N.
+
config IOSCHED_NOOP
bool
default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING) += elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
.store = queue_iostats_store,
};
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+ .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_idle_show,
+ .store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+ .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_sync_show,
+ .store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+ .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_async_show,
+ .store = elv_slice_async_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ &queue_slice_idle_entry.attr,
+ &queue_slice_sync_entry.attr,
+ &queue_slice_async_entry.attr,
+#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE (5)
+#define ELV_HW_QUEUE_MIN (5)
+#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
+ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+ unsigned short prio)
+{
+ const int base_slice = efqd->elv_slice[sync];
+
+ WARN_ON(prio >= IOPRIO_BE_NR);
+
+ return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations. This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT 22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+ return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+ bfq_weight_t weight)
+{
+ bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+ do_div(d, weight);
+ return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+ bfq_service_t service)
+{
+ BUG_ON(entity->weight == 0);
+
+ entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data == NULL)
+ ioq = container_of(entity, struct io_queue, entity);
+ return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity. This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+ struct io_entity *entity = NULL;
+
+ if (node != NULL)
+ entity = rb_entry(node, struct io_entity, rb_node);
+
+ return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+ BUG_ON(entity->tree != root);
+
+ entity->tree = NULL;
+ rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *next;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ BUG_ON(entity->tree != &st->idle);
+
+ if (entity == st->first_idle) {
+ next = rb_next(&entity->rb_node);
+ st->first_idle = bfq_entity_of(next);
+ }
+
+ if (entity == st->last_idle) {
+ next = rb_prev(&entity->rb_node);
+ st->last_idle = bfq_entity_of(next);
+ }
+
+ bfq_extract(&st->idle, entity);
+
+ /* Delete queue from idle list */
+ if (ioq)
+ list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+ struct io_entity *entry;
+ struct rb_node **node = &root->rb_node;
+ struct rb_node *parent = NULL;
+
+ BUG_ON(entity->tree != NULL);
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb_entry(parent, struct io_entity, rb_node);
+
+ if (bfq_gt(entry->finish, entity->finish))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, root);
+
+ entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree. The function assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+ struct rb_node *node)
+{
+ struct io_entity *child;
+
+ if (node != NULL) {
+ child = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entity->min_start, child->min_start))
+ entity->min_start = child->min_start;
+ }
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value. The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+ struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+ entity->min_start = entity->start;
+ bfq_update_min(entity, node->rb_right);
+ bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update. This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root. The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+ struct rb_node *parent;
+
+up:
+ bfq_update_active_node(node);
+
+ parent = rb_parent(node);
+ if (parent == NULL)
+ return;
+
+ if (node == parent->rb_left && parent->rb_right != NULL)
+ bfq_update_active_node(parent->rb_right);
+ else if (parent->rb_left != NULL)
+ bfq_update_active_node(parent->rb_left);
+
+ node = parent;
+ goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node = &entity->rb_node;
+
+ bfq_insert(&st->active, entity);
+
+ if (node->rb_left != NULL)
+ node = node->rb_left;
+ else if (node->rb_right != NULL)
+ node = node->rb_right;
+
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (ioq)
+ elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch. If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+ struct rb_node *deepest;
+
+ if (node->rb_right == NULL && node->rb_left == NULL)
+ deepest = rb_parent(node);
+ else if (node->rb_right == NULL)
+ deepest = node->rb_left;
+ else if (node->rb_left == NULL)
+ deepest = node->rb_right;
+ else {
+ deepest = rb_next(node);
+ if (deepest->rb_right != NULL)
+ deepest = deepest->rb_right;
+ else if (rb_parent(deepest) != node)
+ deepest = rb_parent(deepest);
+ }
+
+ return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node;
+
+ node = bfq_find_deepest(&entity->rb_node);
+ bfq_extract(&st->active, entity);
+
+ if (node != NULL)
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+ st->first_idle = entity;
+ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+ st->last_idle = entity;
+
+ bfq_insert(&st->idle, entity);
+
+ /* Add this queue to idle list */
+ if (ioq)
+ list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue. Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(!entity->on_st);
+ entity->on_st = 0;
+ st->wsum -= entity->weight;
+ ioq = io_entity_to_ioq(entity);
+ if (!ioq)
+ return;
+ elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ bfq_idle_extract(st, entity);
+ bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+
+ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+ !bfq_gt(last_idle->finish, st->vtime)) {
+ /*
+ * Active tree is empty. Pull back vtime to finish time of
+ * last idle entity on idle tree.
+ * Rational seems to be that it reduces the possibility of
+ * vtime wraparound (bfq_gt(V-F) < 0).
+ */
+ st->vtime = last_idle->finish;
+ }
+
+ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+ bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+ struct io_entity *entity)
+{
+ struct io_service_tree *new_st = old_st;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (entity->ioprio_changed) {
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->ioprio_changed = 0;
+
+ /*
+ * Also update the scaled budget for ioq. Group will get the
+ * updated budget once ioq is selected to run next.
+ */
+ if (ioq) {
+ struct elv_fq_data *efqd = ioq->efqd;
+ entity->budget = elv_prio_to_slice(efqd, ioq);
+ }
+
+ old_st->wsum -= entity->weight;
+ entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+ /*
+ * NOTE: here we may be changing the weight too early,
+ * this will cause unfairness. The correct approach
+ * would have required additional complexity to defer
+ * weight changes to the proper time instants (i.e.,
+ * when entity->finish <= old_st->vtime).
+ */
+ new_st = io_entity_service_tree(entity);
+ new_st->wsum += entity->weight;
+
+ if (new_st != old_st)
+ entity->start = new_st->vtime;
+ }
+
+ return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion. It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ if (entity == sd->active_entity) {
+ BUG_ON(entity->tree != NULL);
+ /*
+ * If we are requeueing the current entity we have
+ * to take care of not charging to it service it has
+ * not received.
+ */
+ bfq_calc_finish(entity, entity->service);
+ entity->start = entity->finish;
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active) {
+ /*
+ * Requeueing an entity due to a change of some
+ * next_active entity below it. We reuse the old
+ * start time.
+ */
+ bfq_active_extract(st, entity);
+ } else if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq_idle_extract() will
+ * check for that.
+ */
+ bfq_idle_extract(st, entity);
+ entity->start = bfq_gt(st->vtime, entity->finish) ?
+ st->vtime : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = st->vtime;
+ st->wsum += entity->weight;
+ bfq_get_entity(entity);
+
+ BUG_ON(entity->on_st);
+ entity->on_st = 1;
+ }
+
+ st = __bfq_entity_update_prio(st, entity);
+ /*
+ * This is to emulate cfq like functionality where preemption can
+ * happen with-in same class, like sync queue preempting async queue
+ * May be this is not a very good idea from fairness point of view
+ * as preempting queue gains share. Keeping it for now.
+ */
+ if (add_front) {
+ struct io_entity *next_entity;
+
+ /*
+ * Determine the entity which will be dispatched next
+ * Use sd->next_active once hierarchical patch is applied
+ */
+ next_entity = bfq_lookup_next_entity(sd, 0);
+
+ if (next_entity && next_entity != entity) {
+ struct io_service_tree *new_st;
+ bfq_timestamp_t delta;
+
+ new_st = io_entity_service_tree(next_entity);
+
+ /*
+ * At this point, both entities should belong to
+ * same service tree as cross service tree preemption
+ * is automatically taken care by algorithm
+ */
+ BUG_ON(new_st != st);
+ entity->finish = next_entity->finish - 1;
+ delta = bfq_delta(entity->budget, entity->weight);
+ entity->start = entity->finish - delta;
+ if (bfq_gt(entity->start, st->vtime))
+ entity->start = st->vtime;
+ }
+ } else {
+ bfq_calc_finish(entity, entity->budget);
+ }
+ bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ __bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state. If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+ int was_active = entity == sd->active_entity;
+ int ret = 0;
+
+ if (!entity->on_st)
+ return 0;
+
+ BUG_ON(was_active && entity->tree != NULL);
+
+ if (was_active) {
+ bfq_calc_finish(entity, entity->service);
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active)
+ bfq_active_extract(st, entity);
+ else if (entity->tree == &st->idle)
+ bfq_idle_extract(st, entity);
+ else if (entity->tree != NULL)
+ BUG();
+
+ if (!requeue || !bfq_gt(entity->finish, st->vtime))
+ bfq_forget_entity(st, entity);
+ else
+ bfq_idle_insert(st, entity);
+
+ BUG_ON(sd->active_entity == entity);
+
+ return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ __bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time. Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+ struct io_entity *entry;
+ struct rb_node *node = st->active.rb_node;
+
+ entry = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entry->min_start, st->vtime)) {
+ st->vtime = entry->min_start;
+ bfq_forget_idle(st);
+ }
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity. The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+ struct io_entity *entry, *first = NULL;
+ struct rb_node *node = st->active.rb_node;
+
+ while (node != NULL) {
+ entry = rb_entry(node, struct io_entity, rb_node);
+left:
+ if (!bfq_gt(entry->start, st->vtime))
+ first = entry;
+
+ BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+ if (node->rb_left != NULL) {
+ entry = rb_entry(node->rb_left,
+ struct io_entity, rb_node);
+ if (!bfq_gt(entry->min_start, st->vtime)) {
+ node = node->rb_left;
+ goto left;
+ }
+ }
+ if (first != NULL)
+ break;
+ node = node->rb_right;
+ }
+
+ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+ return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+ struct io_entity *entity;
+
+ if (RB_EMPTY_ROOT(&st->active))
+ return NULL;
+
+ bfq_update_vtime(st);
+ entity = bfq_first_active_entity(st);
+ BUG_ON(bfq_gt(entity->start, st->vtime));
+
+ return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract)
+{
+ struct io_service_tree *st = sd->service_tree;
+ struct io_entity *entity;
+ int i;
+
+ /*
+ * One can check for which will be next selected entity without
+ * expiring the current one.
+ */
+ BUG_ON(extract && sd->active_entity != NULL);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ entity = __bfq_lookup_next_entity(st);
+ if (entity != NULL) {
+ if (extract) {
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ }
+ break;
+ }
+ }
+
+ return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+ struct io_service_tree *st;
+
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+ return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+ return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+ return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+ return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+ return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+ entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = jiffies_to_msecs(efqd->elv_slice_idle);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ data = msecs_to_jiffies(data);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice_idle = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[1];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[1] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[0];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[0] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (elv_nr_busy_ioq(q->elevator)) {
+ elv_log(efqd, "schedule dispatch");
+ kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+ }
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+ struct elv_fq_data *efqd =
+ container_of(work, struct elv_fq_data, unplug_work);
+ struct request_queue *q = efqd->queue;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_start_queueing(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+ del_timer_sync(&e->efqd.idle_slice_timer);
+ cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ ioq->slice_end = jiffies + ioq->entity.budget;
+ elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ unsigned long elapsed = jiffies - ioq->last_end_request;
+ unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+ ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+ ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+ ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+ struct io_queue *ioq, struct request *rq)
+{
+ int old_idle, enable_idle;
+ struct elv_fq_data *efqd = ioq->efqd;
+
+ /*
+ * Don't idle for async or idle io prio class
+ */
+ if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+ return;
+
+ enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+ if (!efqd->elv_slice_idle)
+ enable_idle = 0;
+ else if (ioq_sample_valid(ioq->ttime_samples)) {
+ if (ioq->ttime_mean > efqd->elv_slice_idle)
+ enable_idle = 0;
+ else
+ enable_idle = 1;
+ }
+
+ /*
+ * From think time perspective idle should be enabled. Check with
+ * io scheduler if it wants to disable idling based on additional
+ * considrations like seek pattern.
+ */
+ if (enable_idle) {
+ if (eq->ops->elevator_update_idle_window_fn)
+ enable_idle = eq->ops->elevator_update_idle_window_fn(
+ eq, ioq->sched_queue, rq);
+ if (!enable_idle)
+ elv_log_ioq(efqd, ioq, "iosched disabled idle");
+ }
+
+ if (old_idle != enable_idle) {
+ elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+ if (enable_idle)
+ elv_mark_ioq_idle_window(ioq);
+ else
+ elv_clear_ioq_idle_window(ioq);
+ }
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+ struct io_queue *ioq = NULL;
+
+ ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+ return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+ kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio,
+ int is_sync)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+ struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+ RB_CLEAR_NODE(&ioq->entity.rb_node);
+ atomic_set(&ioq->ref, 0);
+ ioq->efqd = efqd;
+ elv_ioq_set_ioprio_class(ioq, ioprio_class);
+ elv_ioq_set_ioprio(ioq, ioprio);
+ ioq->pid = current->pid;
+ ioq->sched_queue = sched_queue;
+ if (is_sync && !elv_ioq_class_idle(ioq))
+ elv_mark_ioq_idle_window(ioq);
+ bfq_init_entity(&ioq->entity, iog);
+ ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+ return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+ efqd);
+
+ BUG_ON(atomic_read(&ioq->ref) <= 0);
+ if (!atomic_dec_and_test(&ioq->ref))
+ return;
+ BUG_ON(ioq->nr_queued);
+ BUG_ON(ioq->entity.tree != NULL);
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(efqd->active_queue == ioq);
+
+ /* Can be called by outgoing elevator. Don't use q */
+ BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+ e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+ elv_log_ioq(efqd, ioq, "put_queue");
+ elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+ struct io_queue *ioq = *ioq_ptr;
+
+ if (ioq != NULL) {
+ /* Drop the reference taken by the io group */
+ elv_put_ioq(ioq);
+ *ioq_ptr = NULL;
+ }
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ struct io_sched_data *sd = &efqd->root_group->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+ BUG_ON(!efqd->busy_queues);
+ BUG_ON(sd != entity->sched_data);
+ BUG_ON(!st);
+
+ bfq_update_vtime(st);
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ entity->service = 0;
+ elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = NULL;
+ struct io_queue *ioq = NULL;
+ struct io_sched_data *sd;
+
+ /*
+ * one can check for which queue will be selected next while having
+ * one queue active. preempt logic uses it.
+ */
+ BUG_ON(extract && efqd->active_queue != NULL);
+
+ if (!efqd->busy_queues)
+ return NULL;
+
+ sd = &efqd->root_group->sched_data;
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ BUG_ON(!entity);
+ if (extract)
+ entity->service = 0;
+ ioq = io_entity_to_ioq(entity);
+
+ return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int coop)
+{
+ struct request_queue *q = efqd->queue;
+
+ if (ioq) {
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+ efqd->busy_queues);
+ ioq->slice_end = 0;
+
+ elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_mark_ioq_slice_new(ioq);
+
+ del_timer(&efqd->idle_slice_timer);
+ }
+
+ efqd->active_queue = ioq;
+
+ /* Let iosched know if it wants to take some action */
+ if (ioq) {
+ if (q->elevator->ops->elevator_active_ioq_set_fn)
+ q->elevator->ops->elevator_active_ioq_set_fn(q,
+ ioq->sched_queue, coop);
+ }
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ int coop = 0;
+
+ if (!ioq)
+ ioq = elv_get_next_ioq(q, 1);
+ else {
+ elv_set_next_ioq(q, ioq);
+ /*
+ * io scheduler selected the next queue for us. Pass this
+ * this info back to io scheudler. cfq currently uses it
+ * to reset coop flag on the queue.
+ */
+ coop = 1;
+ }
+ __elv_set_active_ioq(efqd, ioq, coop);
+ return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+ struct request_queue *q = efqd->queue;
+ struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+ if (q->elevator->ops->elevator_active_ioq_reset_fn)
+ q->elevator->ops->elevator_active_ioq_reset_fn(q,
+ ioq->sched_queue);
+ efqd->active_queue = NULL;
+ del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+ bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue)
+{
+ if (ioq == efqd->active_queue)
+ elv_reset_active_ioq(efqd);
+
+ bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(ioq == efqd->active_queue);
+ elv_log_ioq(efqd, ioq, "add to busy");
+ elv_activate_ioq(ioq, 0);
+ elv_mark_ioq_busy(ioq);
+ efqd->busy_queues++;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ BUG_ON(!elv_ioq_busy(ioq));
+ BUG_ON(ioq->nr_queued);
+ elv_log_ioq(efqd, ioq, "del from busy");
+ elv_clear_ioq_busy(ioq);
+ BUG_ON(efqd->busy_queues == 0);
+ efqd->busy_queues--;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues--;
+
+ elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+ assert_spin_locked(q->queue_lock);
+ elv_log_ioq(efqd, ioq, "slice expired");
+
+ if (elv_ioq_wait_request(ioq))
+ del_timer(&efqd->idle_slice_timer);
+
+ elv_clear_ioq_wait_request(ioq);
+
+ /*
+ * if ioq->slice_end = 0, that means a queue was expired before first
+ * reuqest from the queue got completed. Of course we are not planning
+ * to idle on the queue otherwise we would not have expired it.
+ *
+ * Charge for the 25% slice in such cases. This is not the best thing
+ * to do but at the same time not very sure what's the next best
+ * thing to do.
+ *
+ * This arises from that fact that we don't have the notion of
+ * one queue being operational at one time. io scheduler can dispatch
+ * requests from multiple queues in one dispatch round. Ideally for
+ * more accurate accounting of exact disk time used by disk, one
+ * should dispatch requests from only one queue and wait for all
+ * the requests to finish. But this will reduce throughput.
+ */
+ if (!ioq->slice_end)
+ slice_used = entity->budget/4;
+ else {
+ if (time_after(ioq->slice_end, jiffies)) {
+ slice_unused = ioq->slice_end - jiffies;
+ if (slice_unused == entity->budget) {
+ /*
+ * queue got expired immediately after
+ * completing first request. Charge 25% of
+ * slice.
+ */
+ slice_used = entity->budget/4;
+ } else
+ slice_used = entity->budget - slice_unused;
+ } else {
+ slice_overshoot = jiffies - ioq->slice_end;
+ slice_used = entity->budget + slice_overshoot;
+ }
+ }
+
+ elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+ jiffies);
+ elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+ slice_used, entity->budget, slice_overshoot);
+ elv_ioq_served(ioq, slice_used);
+
+ BUG_ON(ioq != efqd->active_queue);
+ elv_reset_active_ioq(efqd);
+
+ if (!ioq->nr_queued)
+ elv_del_ioq_busy(q->elevator, ioq, 1);
+ else
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ * Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (ioq)
+ __elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+ struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elevator_queue *eq = q->elevator;
+
+ ioq = elv_active_ioq(eq);
+
+ if (!ioq)
+ return 0;
+
+ if (elv_ioq_slice_used(ioq))
+ return 1;
+
+ if (elv_ioq_class_idle(new_ioq))
+ return 0;
+
+ if (elv_ioq_class_idle(ioq))
+ return 1;
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+ if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+ return 1;
+
+ /*
+ * Check with io scheduler if it has additional criterion based on
+ * which it wants to preempt existing queue.
+ */
+ if (eq->ops->elevator_should_preempt_fn)
+ return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+ return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+ elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+ elv_ioq_slice_expired(q);
+
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
+
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ BUG_ON(!efqd);
+ BUG_ON(!ioq);
+ efqd->rq_queued++;
+ ioq->nr_queued++;
+
+ if (!elv_ioq_busy(ioq))
+ elv_add_ioq_busy(efqd, ioq);
+
+ elv_ioq_update_io_thinktime(ioq);
+ elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+ if (ioq == elv_active_ioq(q->elevator)) {
+ /*
+ * Remember that we saw a request from this process, but
+ * don't start queuing just yet. Otherwise we risk seeing lots
+ * of tiny requests, because we disrupt the normal plugging
+ * and merging. If the request is already larger than a single
+ * page, let it rip immediately. For that case we assume that
+ * merging is already done. Ditto for a busy system that
+ * has other work pending, don't risk delaying until the
+ * idle timer unplug to continue working.
+ */
+ if (elv_ioq_wait_request(ioq)) {
+ if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+ efqd->busy_queues > 1) {
+ del_timer(&efqd->idle_slice_timer);
+ blk_start_queueing(q);
+ }
+ elv_mark_ioq_must_dispatch(ioq);
+ }
+ } else if (elv_should_preempt(q, ioq, rq)) {
+ /*
+ * not the active queue - expire current slice if it is
+ * idle and has expired it's mean thinktime or this new queue
+ * has some old slice time left and is of higher priority or
+ * this new queue is RT and the current one is BE
+ */
+ elv_preempt_queue(q, ioq);
+ blk_start_queueing(q);
+ }
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+ struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+ struct io_queue *ioq;
+ unsigned long flags;
+ struct request_queue *q = efqd->queue;
+
+ elv_log(efqd, "idle timer fired");
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ ioq = efqd->active_queue;
+
+ if (ioq) {
+
+ /*
+ * We saw a request before the queue expired, let it through
+ */
+ if (elv_ioq_must_dispatch(ioq))
+ goto out_kick;
+
+ /*
+ * expired
+ */
+ if (elv_ioq_slice_used(ioq))
+ goto expire;
+
+ /*
+ * only expire and reinvoke request handler, if there are
+ * other queues with pending requests
+ */
+ if (!elv_nr_busy_ioq(q->elevator))
+ goto out_cont;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (ioq->nr_queued)
+ goto out_kick;
+ }
+expire:
+ elv_ioq_slice_expired(q);
+out_kick:
+ elv_schedule_dispatch(q);
+out_cont:
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+ unsigned long sl;
+
+ BUG_ON(!ioq);
+
+ /*
+ * SSD device without seek penalty, disable idling. But only do so
+ * for devices that support queuing, otherwise we still have a problem
+ * with sync vs async workloads.
+ */
+ if (blk_queue_nonrot(q) && efqd->hw_tag)
+ return;
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver)
+ return;
+
+ /*
+ * idle is disabled, either manually or by past process history
+ */
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ return;
+
+ /*
+ * may be iosched got its own idling logic. In that case io
+ * schduler will take care of arming the timer, if need be.
+ */
+ if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ q->elevator->ops->elevator_arm_slice_timer_fn(q,
+ ioq->sched_queue);
+ } else {
+ elv_mark_ioq_wait_request(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu", sl);
+ }
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+ struct io_queue *ioq, *n;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+ elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+ if (!elv_nr_busy_ioq(q->elevator))
+ return NULL;
+
+ if (ioq == NULL)
+ goto new_queue;
+
+ /*
+ * Force dispatch. Continue to dispatch from current queue as long
+ * as it has requests.
+ */
+ if (unlikely(force)) {
+ if (ioq->nr_queued)
+ goto keep_queue;
+ else
+ goto expire;
+ }
+
+ /*
+ * The active queue has run out of time, expire it and select new.
+ */
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+ goto expire;
+
+ /*
+ * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+ * cfqq.
+ */
+ if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+ /*
+ * We simulate this as cfqq timed out so that it gets to bank
+ * the remaining of its time slice.
+ */
+ elv_log_ioq(efqd, ioq, "preempt");
+ goto expire;
+ }
+
+ /*
+ * The active queue has requests and isn't expired, allow it to
+ * dispatch.
+ */
+
+ if (ioq->nr_queued)
+ goto keep_queue;
+
+ /*
+ * If another queue has a request waiting within our mean seek
+ * distance, let it run. The expire code will check for close
+ * cooperators and put the close queue at the front of the service
+ * tree.
+ */
+ new_ioq = elv_close_cooperator(q, ioq, 0);
+ if (new_ioq)
+ goto expire;
+
+ /*
+ * No requests pending. If the active queue still has requests in
+ * flight or is idling for a new request, allow either of these
+ * conditions to happen (or time out) before selecting a new queue.
+ */
+
+ if (timer_pending(&efqd->idle_slice_timer) ||
+ (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
+expire:
+ elv_ioq_slice_expired(q);
+new_queue:
+ ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+ return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elv_fq_data *efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ ioq = rq->ioq;
+ BUG_ON(!ioq);
+ ioq->nr_queued--;
+
+ efqd = ioq->efqd;
+ BUG_ON(!efqd);
+ efqd->rq_queued--;
+
+ if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+ elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ BUG_ON(!ioq);
+ elv_ioq_request_dispatched(ioq);
+ elv_ioq_request_removed(e, rq);
+ elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ efqd->rq_in_driver++;
+ elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ WARN_ON(!efqd->rq_in_driver);
+ efqd->rq_in_driver--;
+ elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+ if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+ efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+ if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+ efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+ return;
+
+ if (efqd->hw_tag_samples++ < 50)
+ return;
+
+ if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+ efqd->hw_tag = 1;
+ else
+ efqd->hw_tag = 0;
+
+ efqd->hw_tag_samples = 0;
+ efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *new_ioq = NULL;
+
+ /*
+ * Currently this feature is supported only for flat hierarchy or
+ * root group queues so that default cfq behavior is not changed.
+ */
+ if (!is_root_group_ioq(q, ioq))
+ return NULL;
+
+ if (q->elevator->ops->elevator_close_cooperator_fn)
+ new_ioq = e->ops->elevator_close_cooperator_fn(q,
+ ioq->sched_queue, probe);
+
+ /* Only select co-operating queue if it belongs to root group */
+ if (new_ioq && !is_root_group_ioq(q, new_ioq))
+ return NULL;
+
+ return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+ const int sync = rq_is_sync(rq);
+ struct io_queue *ioq = rq->ioq;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ elv_log_ioq(efqd, ioq, "complete");
+
+ elv_update_hw_tag(efqd);
+
+ WARN_ON(!efqd->rq_in_driver);
+ WARN_ON(!ioq->dispatched);
+ efqd->rq_in_driver--;
+ ioq->dispatched--;
+
+ if (sync)
+ ioq->last_end_request = jiffies;
+
+ /*
+ * If this is the active queue, check if it needs to be expired,
+ * or if we want to idle in case it has no pending requests.
+ */
+
+ if (elv_active_ioq(q->elevator) == ioq) {
+ if (elv_ioq_slice_new(ioq)) {
+ elv_ioq_set_prio_slice(q, ioq);
+ elv_clear_ioq_slice_new(ioq);
+ }
+ /*
+ * If there are no requests waiting in this queue, and
+ * there are other queues ready to issue requests, AND
+ * those other queues are issuing requests within our
+ * mean seek distance, give them a chance to run instead
+ * of idling.
+ */
+ if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ elv_ioq_slice_expired(q);
+ else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ && sync && !rq_noidle(rq))
+ elv_ioq_arm_slice_timer(q);
+ }
+
+ if (!efqd->rq_in_driver)
+ elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio)
+{
+ struct io_queue *ioq = NULL;
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ ioq = iog->async_queue[0][ioprio];
+ break;
+ case IOPRIO_CLASS_BE:
+ ioq = iog->async_queue[1][ioprio];
+ break;
+ case IOPRIO_CLASS_IDLE:
+ ioq = iog->async_idle_queue;
+ break;
+ default:
+ BUG();
+ }
+
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq)
+{
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ iog->async_queue[0][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_BE:
+ iog->async_queue[1][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_IDLE:
+ iog->async_idle_queue = ioq;
+ break;
+ default:
+ BUG();
+ }
+
+ /*
+ * Take the group reference and pin the queue. Group exit will
+ * clean it up
+ */
+ elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+ /*
+ * Caller already ensured that pending RCU callbacks are completed,
+ * so we should have no busy allocations at this point.
+ */
+ if (elv_ioq_pool)
+ kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+ elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+ if (!elv_ioq_pool)
+ goto fail;
+
+ return 0;
+fail:
+ elv_slab_kill();
+ return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ iog = io_alloc_root_group(q, e, efqd);
+ if (iog == NULL)
+ return 1;
+
+ efqd->root_group = iog;
+ efqd->queue = q;
+
+ init_timer(&efqd->idle_slice_timer);
+ efqd->idle_slice_timer.function = elv_idle_slice_timer;
+ efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+ INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+ INIT_LIST_HEAD(&efqd->idle_list);
+
+ efqd->elv_slice[0] = elv_slice_async;
+ efqd->elv_slice[1] = elv_slice_sync;
+ efqd->elv_slice_idle = elv_slice_idle;
+ efqd->hw_tag = 1;
+
+ return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+ struct request_queue *q = efqd->queue;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+
+ spin_lock_irq(q->queue_lock);
+ /* This should drop all the idle tree references of ioq */
+ elv_free_idle_ioq_list(e);
+ spin_unlock_irq(q->queue_lock);
+
+ elv_shutdown_timer_wq(e);
+
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+ io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+ if (elv_slab_setup())
+ return -ENOMEM;
+
+ /* could be 0 on HZ < 1000 setups */
+
+ if (!elv_slice_async)
+ elv_slice_async = 1;
+
+ if (!elv_slice_idle)
+ elv_slice_idle = 1;
+
+ return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES 3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree. All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+ struct rb_root active;
+ struct rb_root idle;
+
+ struct io_entity *first_idle;
+ struct io_entity *last_idle;
+
+ bfq_timestamp_t vtime;
+ bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue. It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+ struct io_entity *active_entity;
+ struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ * the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ * this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ * associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ * ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ * ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy. Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now. Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag. As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ. When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+ struct rb_node rb_node;
+
+ int on_st;
+
+ bfq_timestamp_t finish;
+ bfq_timestamp_t start;
+
+ struct rb_root *tree;
+
+ bfq_timestamp_t min_start;
+
+ bfq_service_t service, budget;
+ bfq_weight_t weight;
+
+ struct io_entity *parent;
+
+ struct io_sched_data *my_sched_data;
+ struct io_sched_data *sched_data;
+
+ unsigned short ioprio, new_ioprio;
+ unsigned short ioprio_class, new_ioprio_class;
+
+ int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+ struct io_entity entity;
+ atomic_t ref;
+ unsigned int flags;
+
+ /* Pointer to generic elevator data structure */
+ struct elv_fq_data *efqd;
+ struct list_head queue_list;
+ pid_t pid;
+
+ /* Number of requests queued on this io queue */
+ unsigned long nr_queued;
+
+ /* Requests dispatched from this queue */
+ int dispatched;
+
+ /* Keep a track of think time of processes in this queue */
+ unsigned long last_end_request;
+ unsigned long ttime_total;
+ unsigned long ttime_samples;
+ unsigned long ttime_mean;
+
+ unsigned long slice_end;
+
+ /* Pointer to io scheduler's queue */
+ void *sched_queue;
+};
+
+struct io_group {
+ struct io_sched_data sched_data;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+ struct io_group *root_group;
+
+ /* List of io queues on idle tree. */
+ struct list_head idle_list;
+
+ struct request_queue *queue;
+ unsigned int busy_queues;
+ /*
+ * Used to track any pending rt requests so we can pre-empt current
+ * non-RT cfqq in service when this value is non-zero.
+ */
+ unsigned int busy_rt_queues;
+
+ /* Number of requests queued */
+ int rq_queued;
+
+ /* Pointer to the ioscheduler queue being served */
+ void *active_queue;
+
+ int rq_in_driver;
+ int hw_tag;
+ int hw_tag_samples;
+ int rq_in_driver_peak;
+
+ /*
+ * elevator fair queuing layer has the capability to provide idling
+ * for ensuring fairness for processes doing dependent reads.
+ * This might be needed to ensure fairness among two processes doing
+ * synchronous reads in two different cgroups. noop and deadline don't
+ * have any notion of anticipation/idling. As of now, these are the
+ * users of this functionality.
+ */
+ unsigned int elv_slice_idle;
+ struct timer_list idle_slice_timer;
+ struct work_struct unplug_work;
+
+ unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid, \
+ elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples) ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+ ELV_QUEUE_FLAG_busy = 0, /* has requests or is under service */
+ ELV_QUEUE_FLAG_sync, /* synchronous queue */
+ ELV_QUEUE_FLAG_idle_window, /* elevator slice idling enabled */
+ ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
+ ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
+ ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name) \
+static inline void elv_mark_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline int elv_ioq_##name(struct io_queue *ioq) \
+{ \
+ return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0; \
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+ struct io_sched_data *sched_data = entity->sched_data;
+ unsigned int idx = entity->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+ BUG_ON(sched_data == NULL);
+
+ return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+ ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+ if (elv_ioq_slice_new(ioq))
+ return 0;
+ if (time_before(jiffies, ioq->slice_end))
+ return 0;
+
+ return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+ return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+ return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+ return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+ return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+ return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+ atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+ unsigned long slice_end)
+{
+ ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+ int ioprio_class)
+{
+ ioq->entity.new_ioprio_class = ioprio_class;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+ ioq->entity.new_ioprio = ioprio;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+ return container_of(ioq->entity.sched_data, struct io_group,
+ sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+ struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+ unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+ struct elevator_queue *e)
+{
+ return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
for (i = 0; i < ELV_HASH_ENTRIES; i++)
INIT_HLIST_HEAD(&eq->hash[i]);
+ if (elv_init_fq_data(q, eq))
+ goto err;
+
return eq;
err:
kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
e->ops = NULL;
+ elv_exit_fq_data_post(e);
mutex_unlock(&e->sysfs_lock);
kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_activate_rq(q, rq);
+
if (e->ops->elevator_activate_req_fn)
e->ops->elevator_activate_req_fn(q, rq);
}
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_deactivate_rq(q, rq);
+
if (e->ops->elevator_deactivate_req_fn)
e->ops->elevator_deactivate_req_fn(q, rq);
}
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
boundary = q->end_sector;
stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
q->end_sector = rq_end_sector(rq);
q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
elv_rqhash_del(q, next);
q->nr_sorted--;
+ elv_ioq_request_removed(e, next);
q->last_merge = rq;
}
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->last_merge = rq;
}
- /*
- * Some ioscheds (cfq) run q->request_fn directly, so
- * rq cannot be accessed after calling
- * elevator_add_req_fn.
- */
q->elevator->ops->elevator_add_req_fn(q, rq);
+ elv_ioq_request_add(q, rq);
break;
case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
int elv_queue_empty(struct request_queue *q)
{
- struct elevator_queue *e = q->elevator;
-
if (!list_empty(&q->queue_head))
return 0;
- if (e->ops->elevator_queue_empty_fn)
- return e->ops->elevator_queue_empty_fn(q);
+ /* Hopefully nr_sorted works and no need to call queue_empty_fn */
+ if (q->nr_sorted)
+ return 0;
return 1;
}
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
*/
if (blk_account_rq(rq)) {
q->in_flight--;
- if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
- e->ops->elevator_completed_req_fn(q, rq);
+ if (blk_sorted_rq(rq)) {
+ if (e->ops->elevator_completed_req_fn)
+ e->ops->elevator_completed_req_fn(q, rq);
+ elv_ioq_completed_request(q, rq);
+ }
}
/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
return NULL;
}
EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+ return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+ return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
/* for bidi */
struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* io queue request belongs to */
+ struct io_queue *ioq;
+#endif
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
#define _LINUX_ELEVATOR_H
#include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
#ifdef CONFIG_BLOCK
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+ struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+ struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+ void*, int probe);
+#endif
struct elevator_ops
{
@@ -56,6 +69,17 @@ struct elevator_ops
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+ elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+ elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+ elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+ elevator_should_preempt_fn *elevator_should_preempt_fn;
+ elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+ elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
};
#define ELV_NAME_MAX (16)
@@ -76,6 +100,9 @@ struct elevator_type
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ int elevator_features;
+#endif
};
/*
@@ -89,6 +116,10 @@ struct elevator_queue
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* fair queuing data */
+ struct elv_fq_data efqd;
+#endif
};
/*
@@ -209,5 +240,25 @@ enum {
__val; \
})
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define ELV_IOSCHED_NEED_FQ 1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (2 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (33 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o There are situations where a queue gets expired very soon and it looks
as if time slice used by that queue is zero. For example, If an async
queue dispatches a bunch of requests and queue is expired before first
request completes. Another example is where a queue is expired as soon
as first request completes and queue has no more requests (sync queues
on SSD).
o Currently we just charge 25% of slice length in such cases. This patch tries
to improve on that approximation by keeping a track of average disk rate
and charging for time by nr_sectors/disk_rate.
o This is still experimental, not very sure if it gives measurable improvement
or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 11 ++++++
2 files changed, 94 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
int elv_slice_idle = HZ / 125;
static struct kmem_cache *elv_ioq_pool;
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
}
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+ long elapsed = jiffies - efqd->rate_sampling_start;
+ unsigned long total;
+
+ /* sampling window is off */
+ if (!efqd->rate_sampling_start)
+ return;
+
+ efqd->rate_sectors_current += rq->nr_sectors;
+
+ if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+ return;
+
+ efqd->rate_sectors = (7*efqd->rate_sectors +
+ 256*efqd->rate_sectors_current) / 8;
+
+ if (!elapsed) {
+ /*
+ * updating rate before a jiffy could complete. Could be a
+ * problem with fast queuing/non-queuing hardware. Should we
+ * look at higher resolution time source?
+ *
+ * In case of non-queuing hardware we will probably not try to
+ * dispatch from multiple queues and will be able to account
+ * for disk time used and will not need this approximation
+ * anyway?
+ */
+ elapsed = 1;
+ }
+
+ efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+ total = efqd->rate_sectors + (efqd->rate_time/2);
+ efqd->mean_rate = total/efqd->rate_time;
+
+ elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+ elapsed, efqd->rate_sectors_current);
+ efqd->rate_sampling_start = 0;
+ efqd->rate_sectors_current = 0;
+}
+
/*
* Disable idle window if the process thinks too long.
* This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
}
/*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ unsigned long jiffies_used = 0;
+
+ if (!efqd->mean_rate)
+ return entity->budget/4;
+
+ /* Charge the queue based on average disk rate */
+ jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+ if (!jiffies_used)
+ jiffies_used = 1;
+
+ elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+ jiffies_to_msecs(jiffies_used),
+ ioq->nr_sectors, efqd->mean_rate);
+
+ return jiffies_used;
+}
+
+/*
* Do the accounting. Determine how much service (in terms of time slices)
* current queue used and adjust the start, finish time of queue and vtime
* of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* the requests to finish. But this will reduce throughput.
*/
if (!ioq->slice_end)
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
else {
if (time_after(ioq->slice_end, jiffies)) {
slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* completing first request. Charge 25% of
* slice.
*/
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
} else
slice_used = entity->budget - slice_unused;
} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
BUG_ON(ioq != efqd->active_queue);
elv_reset_active_ioq(efqd);
+ /* Queue is being expired. Reset number of secotrs dispatched */
+ ioq->nr_sectors = 0;
if (!ioq->nr_queued)
elv_del_ioq_busy(q->elevator, ioq, 1);
else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
BUG_ON(!ioq);
elv_ioq_request_dispatched(ioq);
+ ioq->nr_sectors += rq->nr_sectors;
elv_ioq_request_removed(e, rq);
elv_clear_ioq_must_dispatch(ioq);
}
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
return;
efqd->rq_in_driver++;
+
+ if (!efqd->rate_sampling_start)
+ efqd->rate_sampling_start = jiffies;
+
elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
efqd->rq_in_driver);
}
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
efqd->rq_in_driver--;
ioq->dispatched--;
+ elv_update_io_rate(efqd, rq);
+
if (sync)
ioq->last_end_request = jiffies;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
/* Requests dispatched from this queue */
int dispatched;
+ /* Number of sectors dispatched in current dispatch round */
+ int nr_sectors;
+
/* Keep a track of think time of processes in this queue */
unsigned long last_end_request;
unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
struct work_struct unplug_work;
unsigned int elv_slice[2];
+
+ /* Fields for keeping track of average disk rate */
+ unsigned long rate_sectors; /* number of sectors finished */
+ unsigned long rate_time; /* jiffies elapsed */
+ unsigned long mean_rate; /* sectors per jiffy */
+ unsigned long long rate_sampling_start; /*sampling window start jifies*/
+ /* number of sectors finished io during current sampling window */
+ unsigned long rate_sectors_current;
};
extern int elv_slice_idle;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (3 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
` (32 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o There are situations where a queue gets expired very soon and it looks
as if time slice used by that queue is zero. For example, If an async
queue dispatches a bunch of requests and queue is expired before first
request completes. Another example is where a queue is expired as soon
as first request completes and queue has no more requests (sync queues
on SSD).
o Currently we just charge 25% of slice length in such cases. This patch tries
to improve on that approximation by keeping a track of average disk rate
and charging for time by nr_sectors/disk_rate.
o This is still experimental, not very sure if it gives measurable improvement
or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 11 ++++++
2 files changed, 94 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
int elv_slice_idle = HZ / 125;
static struct kmem_cache *elv_ioq_pool;
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
}
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+ long elapsed = jiffies - efqd->rate_sampling_start;
+ unsigned long total;
+
+ /* sampling window is off */
+ if (!efqd->rate_sampling_start)
+ return;
+
+ efqd->rate_sectors_current += rq->nr_sectors;
+
+ if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+ return;
+
+ efqd->rate_sectors = (7*efqd->rate_sectors +
+ 256*efqd->rate_sectors_current) / 8;
+
+ if (!elapsed) {
+ /*
+ * updating rate before a jiffy could complete. Could be a
+ * problem with fast queuing/non-queuing hardware. Should we
+ * look at higher resolution time source?
+ *
+ * In case of non-queuing hardware we will probably not try to
+ * dispatch from multiple queues and will be able to account
+ * for disk time used and will not need this approximation
+ * anyway?
+ */
+ elapsed = 1;
+ }
+
+ efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+ total = efqd->rate_sectors + (efqd->rate_time/2);
+ efqd->mean_rate = total/efqd->rate_time;
+
+ elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+ elapsed, efqd->rate_sectors_current);
+ efqd->rate_sampling_start = 0;
+ efqd->rate_sectors_current = 0;
+}
+
/*
* Disable idle window if the process thinks too long.
* This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
}
/*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ unsigned long jiffies_used = 0;
+
+ if (!efqd->mean_rate)
+ return entity->budget/4;
+
+ /* Charge the queue based on average disk rate */
+ jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+ if (!jiffies_used)
+ jiffies_used = 1;
+
+ elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+ jiffies_to_msecs(jiffies_used),
+ ioq->nr_sectors, efqd->mean_rate);
+
+ return jiffies_used;
+}
+
+/*
* Do the accounting. Determine how much service (in terms of time slices)
* current queue used and adjust the start, finish time of queue and vtime
* of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* the requests to finish. But this will reduce throughput.
*/
if (!ioq->slice_end)
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
else {
if (time_after(ioq->slice_end, jiffies)) {
slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* completing first request. Charge 25% of
* slice.
*/
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
} else
slice_used = entity->budget - slice_unused;
} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
BUG_ON(ioq != efqd->active_queue);
elv_reset_active_ioq(efqd);
+ /* Queue is being expired. Reset number of secotrs dispatched */
+ ioq->nr_sectors = 0;
if (!ioq->nr_queued)
elv_del_ioq_busy(q->elevator, ioq, 1);
else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
BUG_ON(!ioq);
elv_ioq_request_dispatched(ioq);
+ ioq->nr_sectors += rq->nr_sectors;
elv_ioq_request_removed(e, rq);
elv_clear_ioq_must_dispatch(ioq);
}
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
return;
efqd->rq_in_driver++;
+
+ if (!efqd->rate_sampling_start)
+ efqd->rate_sampling_start = jiffies;
+
elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
efqd->rq_in_driver);
}
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
efqd->rq_in_driver--;
ioq->dispatched--;
+ elv_update_io_rate(efqd, rq);
+
if (sync)
ioq->last_end_request = jiffies;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
/* Requests dispatched from this queue */
int dispatched;
+ /* Number of sectors dispatched in current dispatch round */
+ int nr_sectors;
+
/* Keep a track of think time of processes in this queue */
unsigned long last_end_request;
unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
struct work_struct unplug_work;
unsigned int elv_slice[2];
+
+ /* Fields for keeping track of average disk rate */
+ unsigned long rate_sectors; /* number of sectors finished */
+ unsigned long rate_time; /* jiffies elapsed */
+ unsigned long mean_rate; /* sectors per jiffy */
+ unsigned long long rate_sampling_start; /*sampling window start jifies*/
+ /* number of sectors finished io during current sampling window */
+ unsigned long rate_sectors_current;
};
extern int elv_slice_idle;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (4 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
[not found] ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-22 8:54 ` Gui Jianfeng
2009-05-05 19:58 ` Vivek Goyal
` (31 subsequent siblings)
37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This patch changes cfq to use fair queuing code from elevator layer.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 3 +-
block/cfq-iosched.c | 1097 ++++++++++---------------------------------------
2 files changed, 219 insertions(+), 881 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
menu "IO Schedulers"
config ELV_FAIR_QUEUING
- bool "Elevator Fair Queuing Support"
+ bool
default n
---help---
Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select ELV_FAIR_QUEUING
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
-
/*
* tunables
*/
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
static const int cfq_back_max = 16 * 1024;
/* penalty of a backwards seek */
static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY (HZ / 5)
/*
* below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq) (struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
static struct kmem_cache *cfq_pool;
static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);
#define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
#define sample_valid(samples) ((samples) > 80)
@@ -75,12 +64,6 @@ struct cfq_rb_root {
*/
struct cfq_data {
struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_rb_root service_tree;
-
/*
* Each priority tree is sorted by next_request position. These
* trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
*/
struct rb_root prio_trees[CFQ_PRIO_LISTS];
- unsigned int busy_queues;
- /*
- * Used to track any pending rt requests so we can pre-empt current
- * non-RT cfqq in service when this value is non-zero.
- */
- unsigned int busy_rt_queues;
-
- int rq_in_driver;
int sync_flight;
- /*
- * queue-depth detection
- */
- int rq_queued;
- int hw_tag;
- int hw_tag_samples;
- int rq_in_driver_peak;
-
- /*
- * idle window management
- */
- struct timer_list idle_slice_timer;
- struct work_struct unplug_work;
-
- struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;
unsigned long last_end_request;
@@ -131,9 +85,7 @@ struct cfq_data {
unsigned int cfq_fifo_expire[2];
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
- unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
- unsigned int cfq_slice_idle;
struct list_head cic_list;
};
@@ -142,16 +94,11 @@ struct cfq_data {
* Per process-grouping structure
*/
struct cfq_queue {
- /* reference count */
- atomic_t ref;
+ struct io_queue *ioq;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;
- unsigned long slice_end;
- long slice_resid;
unsigned int slice_dispatch;
/* pending metadata requests */
int meta_pending;
- /* number of requests that are on the dispatch list or inside driver */
- int dispatched;
/* io prio of this group */
- unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
+ unsigned short org_ioprio;
+ unsigned short org_ioprio_class;
pid_t pid;
};
enum cfqq_state_flags {
- CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
- CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
- CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */
CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
- CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
- CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
- CFQ_CFQQ_FLAG_sync, /* synchronous queue */
CFQ_CFQQ_FLAG_coop, /* has done a coop jump of the queue */
};
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \
return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0; \
}
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
CFQ_CFQQ_FNS(must_alloc);
CFQ_CFQQ_FNS(must_alloc_slice);
CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
return 0;
}
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
{
- if (cfqd->busy_queues) {
- cfq_log(cfqd, "schedule dispatch");
- kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
- }
+ return ioq_to_io_group(cfqq->ioq);
}
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- return !cfqd->busy_queues;
+ return elv_ioq_class_idle(cfqq->ioq);
}
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
- unsigned short prio)
-{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
{
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+ return elv_ioq_class_rt(cfqq->ioq);
}
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
- cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+ return elv_ioq_sync(cfqq->ioq);
}
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
+ struct cfq_data *cfqd = cfqq->cfqd;
+ struct elevator_queue *e = cfqd->queue->elevator;
- return 1;
+ return (elv_active_sched_queue(e) == cfqq);
}
/*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
}
/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
* would be nice to take fifo expire time into account as well
*/
static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
BUG_ON(RB_EMPTY_NODE(&last->rb_node));
- if (rbprev)
+ if (rbprev != NULL)
prev = rb_entry_rq(rbprev);
- if (rbnext)
+ if (rbnext != NULL)
next = rb_entry_rq(rbnext);
else {
rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev);
}
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
-{
- struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
-
- if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
- } else
- rb_key += jiffies;
- } else if (!add_front) {
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key += cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else
- rb_key = 0;
-
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- }
-
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (rb_key < __cfqq->rb_key)
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
-
- if (n == &(*p)->rb_right)
- left = 0;
-
- p = n;
- }
-
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
-
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
static struct cfq_queue *
cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
{
- /*
- * Resorting requires the cfqq to be on the RR list already.
- */
- if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
- cfq_prio_tree_add(cfqd, cfqq);
- }
-}
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- cfq_mark_cfqq_on_rr(cfqq);
- cfqd->busy_queues++;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues++;
+ if (cfqd->active_cic) {
+ put_io_context(cfqd->active_cic->ioc);
+ cfqd->active_cic = NULL;
+ }
- cfq_resort_rr_list(cfqd, cfqq);
+ /* Resort the cfqq in prio tree */
+ if (cfqq)
+ cfq_prio_tree_add(cfqd, cfqq);
}
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
{
- cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
- cfq_clear_cfqq_on_rr(cfqq);
+ struct cfq_queue *cfqq = sched_queue;
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- if (cfqq->p_root) {
- rb_erase(&cfqq->p_node, cfqq->p_root);
- cfqq->p_root = NULL;
- }
+ cfqq->slice_dispatch = 0;
- BUG_ON(!cfqd->busy_queues);
- cfqd->busy_queues--;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues--;
+ cfq_clear_cfqq_must_alloc_slice(cfqq);
+ cfq_clear_cfqq_fifo_expire(cfqq);
+ if (!coop)
+ cfq_clear_cfqq_coop(cfqq);
}
/*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);
BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
elv_rb_del(&cfqq->sort_list, rq);
- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ /*
+ * If this was last request in the queue, remove this queue from
+ * prio trees. For last request nr_queued count will still be 1 as
+ * elevator fair queuing layer is yet to do the accounting.
+ */
+ if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}
static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
cfq_dispatch_insert(cfqd->queue, __alias);
- if (!cfq_cfqq_on_rr(cfqq))
- cfq_add_cfqq_rr(cfqd, cfqq);
-
/*
* check if this request is a better next-serve candidate
*/
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
- cfqd->rq_in_driver++;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
- cfqd->rq_in_driver);
-
cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
}
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- WARN_ON(!cfqd->rq_in_driver);
- cfqd->rq_in_driver--;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
- cfqd->rq_in_driver);
-}
-
static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);
- cfqq->cfqd->rq_queued--;
if (rq_is_meta(rq)) {
WARN_ON(!cfqq->meta_pending);
cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
return 0;
}
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (cfqq) {
- cfq_log_cfqq(cfqd, cfqq, "set_active");
- cfqq->slice_end = 0;
- cfqq->slice_dispatch = 0;
-
- cfq_clear_cfqq_wait_request(cfqq);
- cfq_clear_cfqq_must_dispatch(cfqq);
- cfq_clear_cfqq_must_alloc_slice(cfqq);
- cfq_clear_cfqq_fifo_expire(cfqq);
- cfq_mark_cfqq_slice_new(cfqq);
-
- del_timer(&cfqd->idle_slice_timer);
- }
-
- cfqd->active_queue = cfqq;
-}
-
/*
* current cfqq expired its slice (or was too idle), select new one
*/
static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
- if (cfq_cfqq_wait_request(cfqq))
- del_timer(&cfqd->idle_slice_timer);
-
- cfq_clear_cfqq_wait_request(cfqq);
-
- /*
- * store what was left of this slice, if the queue idled/timed out
- */
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
-
- cfq_resort_rr_list(cfqd, cfqq);
-
- if (cfqq == cfqd->active_queue)
- cfqd->active_queue = NULL;
-
- if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->ioc);
- cfqd->active_cic = NULL;
- }
+ __elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
}
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
-
- return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (!cfqq) {
- cfqq = cfq_get_next_queue(cfqd);
- if (cfqq)
- cfq_clear_cfqq_coop(cfqq);
- }
-
- __cfq_set_active_queue(cfqd, cfqq);
- return cfqq;
+ __cfq_slice_expired(cfqd, cfqq);
}
static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* associated with the I/O issued by cur_cfqq. I'm not sure this is a valid
* assumption.
*/
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
- struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+ void *cur_sched_queue,
int probe)
{
- struct cfq_queue *cfqq;
+ struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
/*
* A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!probe)
cfq_mark_cfqq_coop(cfqq);
- return cfqq;
+ return cfqq->ioq;
}
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_io_context *cic;
unsigned long sl;
- /*
- * SSD device without seek penalty, disable idling. But only do so
- * for devices that support queuing, otherwise we still have a problem
- * with sync vs async workloads.
- */
- if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
-
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
- WARN_ON(cfq_cfqq_slice_new(cfqq));
-
- /*
- * idle is disabled, either manually or by past process history
- */
- if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
- return;
-
- /*
- * still requests with the driver, don't idle
- */
- if (cfqd->rq_in_driver)
- return;
-
+ WARN_ON(elv_ioq_slice_new(cfqq->ioq));
/*
* task has exited, don't wait
*/
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
return;
- cfq_mark_cfqq_wait_request(cfqq);
+ elv_mark_ioq_wait_request(cfqq->ioq);
/*
* we don't want to idle for seeks, but we do want to allow
* fair distribution of slice time for a process doing back-to-back
* seeks. so allow a little bit of time for him to submit a new rq
*/
- sl = cfqd->cfq_slice_idle;
+ sl = elv_get_slice_idle(q->elevator);
if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
- mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+ elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
}
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
*/
static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);
+ struct cfq_data *cfqd = q->elevator->elevator_data;
- cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+ cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
cfq_remove_request(rq);
- cfqq->dispatched++;
elv_dispatch_sort(q, rq);
if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
const int base_rq = cfqd->cfq_slice_async_rq;
+ unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
- WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
- return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
- struct cfq_queue *cfqq, *new_cfqq = NULL;
-
- cfqq = cfqd->active_queue;
- if (!cfqq)
- goto new_queue;
-
- /*
- * The active queue has run out of time, expire it and select new.
- */
- if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
- goto expire;
-
- /*
- * If we have a RT cfqq waiting, then we pre-empt the current non-rt
- * cfqq.
- */
- if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
- /*
- * We simulate this as cfqq timed out so that it gets to bank
- * the remaining of its time slice.
- */
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
- goto new_queue;
- }
-
- /*
- * The active queue has requests and isn't expired, allow it to
- * dispatch.
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto keep_queue;
-
- /*
- * If another queue has a request waiting within our mean seek
- * distance, let it run. The expire code will check for close
- * cooperators and put the close queue at the front of the service
- * tree.
- */
- new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
- if (new_cfqq)
- goto expire;
+ WARN_ON(ioprio >= IOPRIO_BE_NR);
- /*
- * No requests pending. If the active queue still has requests in
- * flight or is idling for a new request, allow either of these
- * conditions to happen (or time out) before selecting a new queue.
- */
- if (timer_pending(&cfqd->idle_slice_timer) ||
- (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
- cfqq = NULL;
- goto keep_queue;
- }
-
-expire:
- cfq_slice_expired(cfqd, 0);
-new_queue:
- cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
- return cfqq;
+ return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
}
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
- cfq_slice_expired(cfqd, 0);
+ /* This probably is redundant now. above loop will should make sure
+ * that all the busy queues have expired */
+ cfq_slice_expired(cfqd);
- BUG_ON(cfqd->busy_queues);
+ BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
struct cfq_queue *cfqq;
unsigned int max_dispatch;
- if (!cfqd->busy_queues)
- return 0;
-
if (unlikely(force))
return cfq_forced_dispatch(cfqd);
- cfqq = cfq_select_queue(cfqd);
+ cfqq = elv_select_sched_queue(q, 0);
if (!cfqq)
return 0;
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* Does this cfqq already have too much IO in flight?
*/
- if (cfqq->dispatched >= max_dispatch) {
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
/*
* idle queue must always only have a single IO in flight
*/
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* We have other queues, don't allow more IO from this one
*/
- if (cfqd->busy_queues > 1)
+ if (elv_nr_busy_ioq(q->elevator) > 1)
return 0;
/*
* we are the only queue, allow up to 4 times of 'quantum'
*/
- if (cfqq->dispatched >= 4 * max_dispatch)
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
return 0;
}
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
*/
cfq_dispatch_request(cfqd, cfqq);
cfqq->slice_dispatch++;
- cfq_clear_cfqq_must_dispatch(cfqq);
/*
* expire an async queue immediately if it has used up its slice. idle
* queue always expire after 1 dispatch round.
*/
- if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+ if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
}
cfq_log(cfqd, "dispatched a request");
return 1;
}
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
{
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_data *cfqd = cfqq->cfqd;
- BUG_ON(atomic_read(&cfqq->ref) <= 0);
+ BUG_ON(!cfqq);
- if (!atomic_dec_and_test(&cfqq->ref))
- return;
-
- cfq_log_cfqq(cfqd, cfqq, "put_queue");
+ cfq_log_cfqq(cfqd, cfqq, "free_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq_is_active_queue(cfqq))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
kmem_cache_free(cfq_pool, cfqq);
}
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+ elv_put_ioq(cfqq->ioq);
+}
+
/*
* Must always be called with the rcu_read_lock() held
*/
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
- cfqd->queue->node);
+ q->node);
if (cic) {
cic->last_end_request = jiffies;
INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
- int ioprio_class;
+ int ioprio_class, ioprio;
if (!cfq_cfqq_prio_changed(cfqq))
return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
/*
* no prio set, inherit CPU scheduling settings
*/
- cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ ioprio = task_nice_ioprio(tsk);
+ ioprio_class = task_nice_ioclass(tsk);
break;
case IOPRIO_CLASS_RT:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_RT;
break;
case IOPRIO_CLASS_BE:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
- cfqq->ioprio = 7;
- cfq_clear_cfqq_idle_window(cfqq);
+ ioprio_class = IOPRIO_CLASS_IDLE;
+ ioprio = 7;
+ elv_clear_ioq_idle_window(cfqq->ioq);
break;
}
+ elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+ elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
*/
- cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio = ioprio;
+ cfqq->org_ioprio_class = ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
struct cfq_data *cfqd = cic->key;
struct cfq_queue *cfqq;
unsigned long flags;
+ struct request_queue *q = cfqd->queue;
if (unlikely(!cfqd))
return;
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
if (cfqq)
cfq_mark_cfqq_prio_changed(cfqq);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+ struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
-
+ struct request_queue *q = cfqd->queue;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
retry:
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
if (!cfqq) {
if (new_cfqq) {
- cfqq = new_cfqq;
- new_cfqq = NULL;
+ goto alloc_ioq;
} else if (gfp_mask & __GFP_WAIT) {
/*
* Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
if (!cfqq)
goto out;
}
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ cfqq = new_cfqq;
+ new_cfqq = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q,
+ gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ kmem_cache_free(cfq_pool, cfqq);
+ cfqq = NULL;
+ goto out;
+ }
+ }
- RB_CLEAR_NODE(&cfqq->rb_node);
+ /*
+ * Both cfqq and ioq objects allocated. Do the initializations
+ * now.
+ */
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
-
- atomic_set(&cfqq->ref, 0);
cfqq->cfqd = cfqd;
cfq_mark_cfqq_prio_changed(cfqq);
+ cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
+ elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+ cfqq->org_ioprio, is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
- cfq_mark_cfqq_idle_window(cfqq);
- cfq_mark_cfqq_sync(cfqq);
+ elv_mark_ioq_idle_window(cfqq->ioq);
+ elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
if (new_cfqq)
kmem_cache_free(cfq_pool, new_cfqq);
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
out:
WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
return cfqq;
}
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
- switch (ioprio_class) {
- case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
- case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
- case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
- default:
- BUG();
- }
-}
-
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+ gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
- struct cfq_queue **async_cfqq = NULL;
+ struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
- cfqq = *async_cfqq;
+ async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+ ioprio);
+ cfqq = async_cfqq;
}
if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
return NULL;
}
- /*
- * pin the queue now that it's allocated, scheduler exit will prune it
- */
- if (!is_sync && !(*async_cfqq)) {
- atomic_inc(&cfqq->ref);
- *async_cfqq = cfqq;
- }
+ if (!is_sync && !async_cfqq)
+ io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- atomic_inc(&cfqq->ref);
+ /* ioc reference */
+ elv_get_ioq(cfqq->ioq);
return cfqq;
}
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
{
unsigned long flags;
int ret;
+ struct request_queue *q = cfqd->queue;
ret = radix_tree_preload(gfp_mask);
if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
radix_tree_preload_end();
if (!ret) {
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
list_add(&cic->queue_list, &cfqd->cic_list);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
}
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct io_context *ioc = NULL;
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
might_sleep_if(gfp_mask & __GFP_WAIT);
- ioc = get_io_context(gfp_mask, cfqd->queue->node);
+ ioc = get_io_context(gfp_mask, q->node);
if (!ioc)
return NULL;
@@ -1889,7 +1494,6 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
-
return cic;
err_free:
cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
}
static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
- unsigned long elapsed = jiffies - cic->last_end_request;
- unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
- cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
- cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
- cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
struct request *rq)
{
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
}
/*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
*/
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+ struct request *rq)
{
- int old_idle, enable_idle;
+ struct cfq_io_context *cic = RQ_CIC(rq);
/*
- * Don't idle for async or idle io prio class
+ * Enabling/Disabling idling based on thinktime has been moved
+ * in common layer.
*/
- if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
- return;
-
- enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
- if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
- enable_idle = 0;
- else if (sample_valid(cic->ttime_samples)) {
- if (cic->ttime_mean > cfqd->cfq_slice_idle)
- enable_idle = 0;
- else
- enable_idle = 1;
- }
+ if (!atomic_read(&cic->ioc->nr_tasks) ||
+ (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+ return 0;
- if (old_idle != enable_idle) {
- cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
- if (enable_idle)
- cfq_mark_cfqq_idle_window(cfqq);
- else
- cfq_clear_cfqq_idle_window(cfqq);
- }
+ return 1;
}
/*
* Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
*/
static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
- struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
{
- struct cfq_queue *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
- cfqq = cfqd->active_queue;
if (!cfqq)
return 0;
- if (cfq_slice_used(cfqq))
- return 1;
-
- if (cfq_class_idle(new_cfqq))
- return 0;
-
- if (cfq_class_idle(cfqq))
- return 1;
-
/*
* if the new request is sync, but the currently running queue is
* not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_meta(rq) && !cfqq->meta_pending)
return 1;
- /*
- * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
- */
- if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
- return 1;
-
- if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+ if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
return 0;
/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
}
/*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
-
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
- cfq_service_tree_add(cfqd, cfqq, 1);
-
- cfqq->slice_end = 0;
- cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
* Called when a new fs request (rq) is added (to cfqq). Check if there's
* something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
*/
static void
cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
{
struct cfq_io_context *cic = RQ_CIC(rq);
- cfqd->rq_queued++;
if (rq_is_meta(rq))
cfqq->meta_pending++;
- cfq_update_io_thinktime(cfqd, cic);
cfq_update_io_seektime(cfqd, cic, rq);
- cfq_update_idle_window(cfqd, cfqq, cic);
cic->last_request_pos = rq->sector + rq->nr_sectors;
-
- if (cfqq == cfqd->active_queue) {
- /*
- * Remember that we saw a request from this process, but
- * don't start queuing just yet. Otherwise we risk seeing lots
- * of tiny requests, because we disrupt the normal plugging
- * and merging. If the request is already larger than a single
- * page, let it rip immediately. For that case we assume that
- * merging is already done. Ditto for a busy system that
- * has other work pending, don't risk delaying until the
- * idle timer unplug to continue working.
- */
- if (cfq_cfqq_wait_request(cfqq)) {
- if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
- cfqd->busy_queues > 1) {
- del_timer(&cfqd->idle_slice_timer);
- blk_start_queueing(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
- }
- } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
- /*
- * not the active queue - expire current slice if it is
- * idle and has expired it's mean thinktime or this new queue
- * has some old slice time left and is of higher priority or
- * this new queue is RT and the current one is BE
- */
- cfq_preempt_queue(cfqd, cfqq);
- blk_start_queueing(cfqd->queue);
- }
}
static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
cfq_rq_enqueued(cfqd, cfqq, rq);
}
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
- if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
- cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
- if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
- cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
- return;
-
- if (cfqd->hw_tag_samples++ < 50)
- return;
-
- if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
- cfqd->hw_tag = 1;
- else
- cfqd->hw_tag = 0;
-
- cfqd->hw_tag_samples = 0;
- cfqd->rq_in_driver_peak = 0;
-}
-
static void cfq_completed_request(struct request_queue *q, struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
now = jiffies;
cfq_log_cfqq(cfqd, cfqq, "complete");
- cfq_update_hw_tag(cfqd);
-
- WARN_ON(!cfqd->rq_in_driver);
- WARN_ON(!cfqq->dispatched);
- cfqd->rq_in_driver--;
- cfqq->dispatched--;
-
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight--;
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
if (sync)
RQ_CIC(rq)->last_end_request = now;
-
- /*
- * If this is the active queue, check if it needs to be expired,
- * or if we want to idle in case it has no pending requests.
- */
- if (cfqd->active_queue == cfqq) {
- const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
- if (cfq_cfqq_slice_new(cfqq)) {
- cfq_set_prio_slice(cfqd, cfqq);
- cfq_clear_cfqq_slice_new(cfqq);
- }
- /*
- * If there are no requests waiting in this queue, and
- * there are other queues ready to issue requests, AND
- * those other queues are issuing requests within our
- * mean seek distance, give them a chance to run instead
- * of idling.
- */
- if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
- else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
- sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
- }
-
- if (!cfqd->rq_in_driver)
- cfq_schedule_dispatch(cfqd);
}
/*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct io_queue *ioq = cfqq->ioq;
+
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
* users of the filesystem
*/
if (cfq_class_idle(cfqq))
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- if (cfqq->ioprio > IOPRIO_NORM)
- cfqq->ioprio = IOPRIO_NORM;
+ elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+ if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+ elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
} else {
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- if (cfqq->ioprio != cfqq->org_ioprio)
- cfqq->ioprio = cfqq->org_ioprio;
+ if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+ elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+ if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+ elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
}
}
static inline int __cfq_may_queue(struct cfq_queue *cfqq)
{
- if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
- !cfq_cfqq_must_alloc_slice(cfqq)) {
+ if ((elv_ioq_wait_request(cfqq->ioq) ||
+ cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
cfq_mark_cfqq_must_alloc_slice(cfqq);
return ELV_MQUEUE_MUST;
}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq->allocated[rw]++;
cfq_clear_cfqq_must_alloc(cfqq);
- atomic_inc(&cfqq->ref);
+ elv_get_ioq(cfqq->ioq);
spin_unlock_irqrestore(q->queue_lock, flags);
rq->elevator_private = cic;
- rq->elevator_private2 = cfqq;
+ rq->ioq = cfqq->ioq;
return 0;
queue_fail:
if (cic)
put_io_context(cic->ioc);
- cfq_schedule_dispatch(cfqd);
+ elv_schedule_dispatch(cfqd->queue);
spin_unlock_irqrestore(q->queue_lock, flags);
cfq_log(cfqd, "set_request fail");
return 1;
}
-static void cfq_kick_queue(struct work_struct *work)
-{
- struct cfq_data *cfqd =
- container_of(work, struct cfq_data, unplug_work);
- struct request_queue *q = cfqd->queue;
-
- spin_lock_irq(q->queue_lock);
- blk_start_queueing(q);
- spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
- struct cfq_data *cfqd = (struct cfq_data *) data;
- struct cfq_queue *cfqq;
- unsigned long flags;
- int timed_out = 1;
-
- cfq_log(cfqd, "idle timer fired");
-
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
- cfqq = cfqd->active_queue;
- if (cfqq) {
- timed_out = 0;
-
- /*
- * We saw a request before the queue expired, let it through
- */
- if (cfq_cfqq_must_dispatch(cfqq))
- goto out_kick;
-
- /*
- * expired
- */
- if (cfq_slice_used(cfqq))
- goto expire;
-
- /*
- * only expire and reinvoke request handler, if there are
- * other queues with pending requests
- */
- if (!cfqd->busy_queues)
- goto out_cont;
-
- /*
- * not expired and it has a request pending, let it dispatch
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto out_kick;
- }
-expire:
- cfq_slice_expired(cfqd, timed_out);
-out_kick:
- cfq_schedule_dispatch(cfqd);
-out_cont:
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
- del_timer_sync(&cfqd->idle_slice_timer);
- cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
- int i;
-
- for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
- }
-
- if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
struct request_queue *q = cfqd->queue;
- cfq_shutdown_timer_wq(cfqd);
-
spin_lock_irq(q->queue_lock);
- if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
while (!list_empty(&cfqd->cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}
- cfq_put_async_queues(cfqd);
-
spin_unlock_irq(q->queue_lock);
-
- cfq_shutdown_timer_wq(cfqd);
-
kfree(cfqd);
}
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;
- cfqd->service_tree = CFQ_RB_ROOT;
-
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->queue = q;
- init_timer(&cfqd->idle_slice_timer);
- cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
- cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
- INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
cfqd->last_end_request = jiffies;
cfqd->cfq_quantum = cfq_quantum;
cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
- cfqd->cfq_slice[0] = cfq_slice_async;
- cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
- cfqd->cfq_slice_idle = cfq_slice_idle;
- cfqd->hw_tag = 1;
return cfqd;
}
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
#undef SHOW_FUNCTION
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
#undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(fifo_expire_async),
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
- CFQ_ATTR(slice_sync),
- CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
- CFQ_ATTR(slice_idle),
__ATTR_NULL
};
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
.elevator_dispatch_fn = cfq_dispatch_requests,
.elevator_add_req_fn = cfq_insert_request,
.elevator_activate_req_fn = cfq_activate_request,
- .elevator_deactivate_req_fn = cfq_deactivate_request,
- .elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
+ .elevator_free_sched_queue_fn = cfq_free_cfq_queue,
+ .elevator_active_ioq_set_fn = cfq_active_ioq_set,
+ .elevator_active_ioq_reset_fn = cfq_active_ioq_reset,
+ .elevator_arm_slice_timer_fn = cfq_arm_slice_timer,
+ .elevator_should_preempt_fn = cfq_should_preempt,
+ .elevator_update_idle_window_fn = cfq_update_idle_window,
+ .elevator_close_cooperator_fn = cfq_close_cooperator,
},
+ .elevator_features = ELV_IOSCHED_NEED_FQ,
.elevator_attrs = cfq_attrs,
.elevator_name = "cfq",
.elevator_owner = THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
static int __init cfq_init(void)
{
- /*
- * could be 0 on HZ < 1000 setups
- */
- if (!cfq_slice_async)
- cfq_slice_async = 1;
- if (!cfq_slice_idle)
- cfq_slice_idle = 1;
-
if (cfq_slab_setup())
return -ENOMEM;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
[not found] ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22 8:54 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22 8:54 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
Since thinking time logic is moving to common layer, corresponding items
in cic is not needed.
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ed52a1f..1fe9d78 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -42,10 +42,6 @@ struct cfq_io_context {
unsigned long last_end_request;
sector_t last_request_pos;
- unsigned long ttime_total;
- unsigned long ttime_samples;
- unsigned long ttime_mean;
-
unsigned int seek_samples;
u64 seek_total;
sector_t seek_mean;
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
[not found] ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22 8:54 ` Gui Jianfeng
[not found] ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-22 12:33 ` Vivek Goyal
1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22 8:54 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Hi Vivek,
Since thinking time logic is moving to common layer, corresponding items
in cic is not needed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ed52a1f..1fe9d78 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -42,10 +42,6 @@ struct cfq_io_context {
unsigned long last_end_request;
sector_t last_request_pos;
- unsigned long ttime_total;
- unsigned long ttime_samples;
- unsigned long ttime_mean;
-
unsigned int seek_samples;
u64 seek_total;
sector_t seek_mean;
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
[not found] ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:33 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:33 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Fri, May 22, 2009 at 04:54:01PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> Since thinking time logic is moving to common layer, corresponding items
> in cic is not needed.
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index ed52a1f..1fe9d78 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -42,10 +42,6 @@ struct cfq_io_context {
> unsigned long last_end_request;
> sector_t last_request_pos;
>
> - unsigned long ttime_total;
> - unsigned long ttime_samples;
> - unsigned long ttime_mean;
> -
> unsigned int seek_samples;
> u64 seek_total;
> sector_t seek_mean;
>
Thanks Gui. Queued for next posting.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
2009-05-22 8:54 ` Gui Jianfeng
[not found] ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:33 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:33 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Fri, May 22, 2009 at 04:54:01PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> Since thinking time logic is moving to common layer, corresponding items
> in cic is not needed.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index ed52a1f..1fe9d78 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -42,10 +42,6 @@ struct cfq_io_context {
> unsigned long last_end_request;
> sector_t last_request_pos;
>
> - unsigned long ttime_total;
> - unsigned long ttime_samples;
> - unsigned long ttime_mean;
> -
> unsigned int seek_samples;
> u64 seek_total;
> sector_t seek_mean;
>
Thanks Gui. Queued for next posting.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (5 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
` (30 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This patch changes cfq to use fair queuing code from elevator layer.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 3 +-
block/cfq-iosched.c | 1097 ++++++++++---------------------------------------
2 files changed, 219 insertions(+), 881 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
menu "IO Schedulers"
config ELV_FAIR_QUEUING
- bool "Elevator Fair Queuing Support"
+ bool
default n
---help---
Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select ELV_FAIR_QUEUING
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
-
/*
* tunables
*/
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
static const int cfq_back_max = 16 * 1024;
/* penalty of a backwards seek */
static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY (HZ / 5)
/*
* below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq) (struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
static struct kmem_cache *cfq_pool;
static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);
#define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
#define sample_valid(samples) ((samples) > 80)
@@ -75,12 +64,6 @@ struct cfq_rb_root {
*/
struct cfq_data {
struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_rb_root service_tree;
-
/*
* Each priority tree is sorted by next_request position. These
* trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
*/
struct rb_root prio_trees[CFQ_PRIO_LISTS];
- unsigned int busy_queues;
- /*
- * Used to track any pending rt requests so we can pre-empt current
- * non-RT cfqq in service when this value is non-zero.
- */
- unsigned int busy_rt_queues;
-
- int rq_in_driver;
int sync_flight;
- /*
- * queue-depth detection
- */
- int rq_queued;
- int hw_tag;
- int hw_tag_samples;
- int rq_in_driver_peak;
-
- /*
- * idle window management
- */
- struct timer_list idle_slice_timer;
- struct work_struct unplug_work;
-
- struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;
unsigned long last_end_request;
@@ -131,9 +85,7 @@ struct cfq_data {
unsigned int cfq_fifo_expire[2];
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
- unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
- unsigned int cfq_slice_idle;
struct list_head cic_list;
};
@@ -142,16 +94,11 @@ struct cfq_data {
* Per process-grouping structure
*/
struct cfq_queue {
- /* reference count */
- atomic_t ref;
+ struct io_queue *ioq;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;
- unsigned long slice_end;
- long slice_resid;
unsigned int slice_dispatch;
/* pending metadata requests */
int meta_pending;
- /* number of requests that are on the dispatch list or inside driver */
- int dispatched;
/* io prio of this group */
- unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
+ unsigned short org_ioprio;
+ unsigned short org_ioprio_class;
pid_t pid;
};
enum cfqq_state_flags {
- CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
- CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
- CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */
CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
- CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
- CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
- CFQ_CFQQ_FLAG_sync, /* synchronous queue */
CFQ_CFQQ_FLAG_coop, /* has done a coop jump of the queue */
};
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \
return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0; \
}
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
CFQ_CFQQ_FNS(must_alloc);
CFQ_CFQQ_FNS(must_alloc_slice);
CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
return 0;
}
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
{
- if (cfqd->busy_queues) {
- cfq_log(cfqd, "schedule dispatch");
- kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
- }
+ return ioq_to_io_group(cfqq->ioq);
}
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- return !cfqd->busy_queues;
+ return elv_ioq_class_idle(cfqq->ioq);
}
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
- unsigned short prio)
-{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
{
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+ return elv_ioq_class_rt(cfqq->ioq);
}
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
- cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+ return elv_ioq_sync(cfqq->ioq);
}
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
+ struct cfq_data *cfqd = cfqq->cfqd;
+ struct elevator_queue *e = cfqd->queue->elevator;
- return 1;
+ return (elv_active_sched_queue(e) == cfqq);
}
/*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
}
/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
* would be nice to take fifo expire time into account as well
*/
static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
BUG_ON(RB_EMPTY_NODE(&last->rb_node));
- if (rbprev)
+ if (rbprev != NULL)
prev = rb_entry_rq(rbprev);
- if (rbnext)
+ if (rbnext != NULL)
next = rb_entry_rq(rbnext);
else {
rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev);
}
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
-{
- struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
-
- if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
- } else
- rb_key += jiffies;
- } else if (!add_front) {
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key += cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else
- rb_key = 0;
-
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- }
-
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (rb_key < __cfqq->rb_key)
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
-
- if (n == &(*p)->rb_right)
- left = 0;
-
- p = n;
- }
-
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
-
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
static struct cfq_queue *
cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
{
- /*
- * Resorting requires the cfqq to be on the RR list already.
- */
- if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
- cfq_prio_tree_add(cfqd, cfqq);
- }
-}
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- cfq_mark_cfqq_on_rr(cfqq);
- cfqd->busy_queues++;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues++;
+ if (cfqd->active_cic) {
+ put_io_context(cfqd->active_cic->ioc);
+ cfqd->active_cic = NULL;
+ }
- cfq_resort_rr_list(cfqd, cfqq);
+ /* Resort the cfqq in prio tree */
+ if (cfqq)
+ cfq_prio_tree_add(cfqd, cfqq);
}
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
{
- cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
- cfq_clear_cfqq_on_rr(cfqq);
+ struct cfq_queue *cfqq = sched_queue;
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- if (cfqq->p_root) {
- rb_erase(&cfqq->p_node, cfqq->p_root);
- cfqq->p_root = NULL;
- }
+ cfqq->slice_dispatch = 0;
- BUG_ON(!cfqd->busy_queues);
- cfqd->busy_queues--;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues--;
+ cfq_clear_cfqq_must_alloc_slice(cfqq);
+ cfq_clear_cfqq_fifo_expire(cfqq);
+ if (!coop)
+ cfq_clear_cfqq_coop(cfqq);
}
/*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);
BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
elv_rb_del(&cfqq->sort_list, rq);
- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ /*
+ * If this was last request in the queue, remove this queue from
+ * prio trees. For last request nr_queued count will still be 1 as
+ * elevator fair queuing layer is yet to do the accounting.
+ */
+ if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}
static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
cfq_dispatch_insert(cfqd->queue, __alias);
- if (!cfq_cfqq_on_rr(cfqq))
- cfq_add_cfqq_rr(cfqd, cfqq);
-
/*
* check if this request is a better next-serve candidate
*/
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
- cfqd->rq_in_driver++;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
- cfqd->rq_in_driver);
-
cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
}
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- WARN_ON(!cfqd->rq_in_driver);
- cfqd->rq_in_driver--;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
- cfqd->rq_in_driver);
-}
-
static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);
- cfqq->cfqd->rq_queued--;
if (rq_is_meta(rq)) {
WARN_ON(!cfqq->meta_pending);
cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
return 0;
}
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (cfqq) {
- cfq_log_cfqq(cfqd, cfqq, "set_active");
- cfqq->slice_end = 0;
- cfqq->slice_dispatch = 0;
-
- cfq_clear_cfqq_wait_request(cfqq);
- cfq_clear_cfqq_must_dispatch(cfqq);
- cfq_clear_cfqq_must_alloc_slice(cfqq);
- cfq_clear_cfqq_fifo_expire(cfqq);
- cfq_mark_cfqq_slice_new(cfqq);
-
- del_timer(&cfqd->idle_slice_timer);
- }
-
- cfqd->active_queue = cfqq;
-}
-
/*
* current cfqq expired its slice (or was too idle), select new one
*/
static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
- if (cfq_cfqq_wait_request(cfqq))
- del_timer(&cfqd->idle_slice_timer);
-
- cfq_clear_cfqq_wait_request(cfqq);
-
- /*
- * store what was left of this slice, if the queue idled/timed out
- */
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
-
- cfq_resort_rr_list(cfqd, cfqq);
-
- if (cfqq == cfqd->active_queue)
- cfqd->active_queue = NULL;
-
- if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->ioc);
- cfqd->active_cic = NULL;
- }
+ __elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
}
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
-
- return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (!cfqq) {
- cfqq = cfq_get_next_queue(cfqd);
- if (cfqq)
- cfq_clear_cfqq_coop(cfqq);
- }
-
- __cfq_set_active_queue(cfqd, cfqq);
- return cfqq;
+ __cfq_slice_expired(cfqd, cfqq);
}
static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* associated with the I/O issued by cur_cfqq. I'm not sure this is a valid
* assumption.
*/
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
- struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+ void *cur_sched_queue,
int probe)
{
- struct cfq_queue *cfqq;
+ struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
/*
* A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!probe)
cfq_mark_cfqq_coop(cfqq);
- return cfqq;
+ return cfqq->ioq;
}
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_io_context *cic;
unsigned long sl;
- /*
- * SSD device without seek penalty, disable idling. But only do so
- * for devices that support queuing, otherwise we still have a problem
- * with sync vs async workloads.
- */
- if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
-
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
- WARN_ON(cfq_cfqq_slice_new(cfqq));
-
- /*
- * idle is disabled, either manually or by past process history
- */
- if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
- return;
-
- /*
- * still requests with the driver, don't idle
- */
- if (cfqd->rq_in_driver)
- return;
-
+ WARN_ON(elv_ioq_slice_new(cfqq->ioq));
/*
* task has exited, don't wait
*/
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
return;
- cfq_mark_cfqq_wait_request(cfqq);
+ elv_mark_ioq_wait_request(cfqq->ioq);
/*
* we don't want to idle for seeks, but we do want to allow
* fair distribution of slice time for a process doing back-to-back
* seeks. so allow a little bit of time for him to submit a new rq
*/
- sl = cfqd->cfq_slice_idle;
+ sl = elv_get_slice_idle(q->elevator);
if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
- mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+ elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
}
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
*/
static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);
+ struct cfq_data *cfqd = q->elevator->elevator_data;
- cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+ cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
cfq_remove_request(rq);
- cfqq->dispatched++;
elv_dispatch_sort(q, rq);
if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
const int base_rq = cfqd->cfq_slice_async_rq;
+ unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
- WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
- return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
- struct cfq_queue *cfqq, *new_cfqq = NULL;
-
- cfqq = cfqd->active_queue;
- if (!cfqq)
- goto new_queue;
-
- /*
- * The active queue has run out of time, expire it and select new.
- */
- if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
- goto expire;
-
- /*
- * If we have a RT cfqq waiting, then we pre-empt the current non-rt
- * cfqq.
- */
- if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
- /*
- * We simulate this as cfqq timed out so that it gets to bank
- * the remaining of its time slice.
- */
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
- goto new_queue;
- }
-
- /*
- * The active queue has requests and isn't expired, allow it to
- * dispatch.
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto keep_queue;
-
- /*
- * If another queue has a request waiting within our mean seek
- * distance, let it run. The expire code will check for close
- * cooperators and put the close queue at the front of the service
- * tree.
- */
- new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
- if (new_cfqq)
- goto expire;
+ WARN_ON(ioprio >= IOPRIO_BE_NR);
- /*
- * No requests pending. If the active queue still has requests in
- * flight or is idling for a new request, allow either of these
- * conditions to happen (or time out) before selecting a new queue.
- */
- if (timer_pending(&cfqd->idle_slice_timer) ||
- (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
- cfqq = NULL;
- goto keep_queue;
- }
-
-expire:
- cfq_slice_expired(cfqd, 0);
-new_queue:
- cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
- return cfqq;
+ return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
}
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
- cfq_slice_expired(cfqd, 0);
+ /* This probably is redundant now. above loop will should make sure
+ * that all the busy queues have expired */
+ cfq_slice_expired(cfqd);
- BUG_ON(cfqd->busy_queues);
+ BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
struct cfq_queue *cfqq;
unsigned int max_dispatch;
- if (!cfqd->busy_queues)
- return 0;
-
if (unlikely(force))
return cfq_forced_dispatch(cfqd);
- cfqq = cfq_select_queue(cfqd);
+ cfqq = elv_select_sched_queue(q, 0);
if (!cfqq)
return 0;
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* Does this cfqq already have too much IO in flight?
*/
- if (cfqq->dispatched >= max_dispatch) {
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
/*
* idle queue must always only have a single IO in flight
*/
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* We have other queues, don't allow more IO from this one
*/
- if (cfqd->busy_queues > 1)
+ if (elv_nr_busy_ioq(q->elevator) > 1)
return 0;
/*
* we are the only queue, allow up to 4 times of 'quantum'
*/
- if (cfqq->dispatched >= 4 * max_dispatch)
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
return 0;
}
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
*/
cfq_dispatch_request(cfqd, cfqq);
cfqq->slice_dispatch++;
- cfq_clear_cfqq_must_dispatch(cfqq);
/*
* expire an async queue immediately if it has used up its slice. idle
* queue always expire after 1 dispatch round.
*/
- if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+ if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
}
cfq_log(cfqd, "dispatched a request");
return 1;
}
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
{
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_data *cfqd = cfqq->cfqd;
- BUG_ON(atomic_read(&cfqq->ref) <= 0);
+ BUG_ON(!cfqq);
- if (!atomic_dec_and_test(&cfqq->ref))
- return;
-
- cfq_log_cfqq(cfqd, cfqq, "put_queue");
+ cfq_log_cfqq(cfqd, cfqq, "free_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq_is_active_queue(cfqq))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
kmem_cache_free(cfq_pool, cfqq);
}
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+ elv_put_ioq(cfqq->ioq);
+}
+
/*
* Must always be called with the rcu_read_lock() held
*/
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
- cfqd->queue->node);
+ q->node);
if (cic) {
cic->last_end_request = jiffies;
INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
- int ioprio_class;
+ int ioprio_class, ioprio;
if (!cfq_cfqq_prio_changed(cfqq))
return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
/*
* no prio set, inherit CPU scheduling settings
*/
- cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ ioprio = task_nice_ioprio(tsk);
+ ioprio_class = task_nice_ioclass(tsk);
break;
case IOPRIO_CLASS_RT:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_RT;
break;
case IOPRIO_CLASS_BE:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
- cfqq->ioprio = 7;
- cfq_clear_cfqq_idle_window(cfqq);
+ ioprio_class = IOPRIO_CLASS_IDLE;
+ ioprio = 7;
+ elv_clear_ioq_idle_window(cfqq->ioq);
break;
}
+ elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+ elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
*/
- cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio = ioprio;
+ cfqq->org_ioprio_class = ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
struct cfq_data *cfqd = cic->key;
struct cfq_queue *cfqq;
unsigned long flags;
+ struct request_queue *q = cfqd->queue;
if (unlikely(!cfqd))
return;
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
if (cfqq)
cfq_mark_cfqq_prio_changed(cfqq);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+ struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
-
+ struct request_queue *q = cfqd->queue;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
retry:
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
if (!cfqq) {
if (new_cfqq) {
- cfqq = new_cfqq;
- new_cfqq = NULL;
+ goto alloc_ioq;
} else if (gfp_mask & __GFP_WAIT) {
/*
* Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
if (!cfqq)
goto out;
}
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ cfqq = new_cfqq;
+ new_cfqq = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q,
+ gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ kmem_cache_free(cfq_pool, cfqq);
+ cfqq = NULL;
+ goto out;
+ }
+ }
- RB_CLEAR_NODE(&cfqq->rb_node);
+ /*
+ * Both cfqq and ioq objects allocated. Do the initializations
+ * now.
+ */
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
-
- atomic_set(&cfqq->ref, 0);
cfqq->cfqd = cfqd;
cfq_mark_cfqq_prio_changed(cfqq);
+ cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
+ elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+ cfqq->org_ioprio, is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
- cfq_mark_cfqq_idle_window(cfqq);
- cfq_mark_cfqq_sync(cfqq);
+ elv_mark_ioq_idle_window(cfqq->ioq);
+ elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
if (new_cfqq)
kmem_cache_free(cfq_pool, new_cfqq);
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
out:
WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
return cfqq;
}
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
- switch (ioprio_class) {
- case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
- case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
- case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
- default:
- BUG();
- }
-}
-
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+ gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
- struct cfq_queue **async_cfqq = NULL;
+ struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
- cfqq = *async_cfqq;
+ async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+ ioprio);
+ cfqq = async_cfqq;
}
if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
return NULL;
}
- /*
- * pin the queue now that it's allocated, scheduler exit will prune it
- */
- if (!is_sync && !(*async_cfqq)) {
- atomic_inc(&cfqq->ref);
- *async_cfqq = cfqq;
- }
+ if (!is_sync && !async_cfqq)
+ io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- atomic_inc(&cfqq->ref);
+ /* ioc reference */
+ elv_get_ioq(cfqq->ioq);
return cfqq;
}
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
{
unsigned long flags;
int ret;
+ struct request_queue *q = cfqd->queue;
ret = radix_tree_preload(gfp_mask);
if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
radix_tree_preload_end();
if (!ret) {
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
list_add(&cic->queue_list, &cfqd->cic_list);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
}
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct io_context *ioc = NULL;
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
might_sleep_if(gfp_mask & __GFP_WAIT);
- ioc = get_io_context(gfp_mask, cfqd->queue->node);
+ ioc = get_io_context(gfp_mask, q->node);
if (!ioc)
return NULL;
@@ -1889,7 +1494,6 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
-
return cic;
err_free:
cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
}
static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
- unsigned long elapsed = jiffies - cic->last_end_request;
- unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
- cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
- cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
- cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
struct request *rq)
{
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
}
/*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
*/
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+ struct request *rq)
{
- int old_idle, enable_idle;
+ struct cfq_io_context *cic = RQ_CIC(rq);
/*
- * Don't idle for async or idle io prio class
+ * Enabling/Disabling idling based on thinktime has been moved
+ * in common layer.
*/
- if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
- return;
-
- enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
- if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
- enable_idle = 0;
- else if (sample_valid(cic->ttime_samples)) {
- if (cic->ttime_mean > cfqd->cfq_slice_idle)
- enable_idle = 0;
- else
- enable_idle = 1;
- }
+ if (!atomic_read(&cic->ioc->nr_tasks) ||
+ (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+ return 0;
- if (old_idle != enable_idle) {
- cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
- if (enable_idle)
- cfq_mark_cfqq_idle_window(cfqq);
- else
- cfq_clear_cfqq_idle_window(cfqq);
- }
+ return 1;
}
/*
* Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
*/
static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
- struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
{
- struct cfq_queue *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
- cfqq = cfqd->active_queue;
if (!cfqq)
return 0;
- if (cfq_slice_used(cfqq))
- return 1;
-
- if (cfq_class_idle(new_cfqq))
- return 0;
-
- if (cfq_class_idle(cfqq))
- return 1;
-
/*
* if the new request is sync, but the currently running queue is
* not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_meta(rq) && !cfqq->meta_pending)
return 1;
- /*
- * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
- */
- if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
- return 1;
-
- if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+ if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
return 0;
/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
}
/*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
-
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
- cfq_service_tree_add(cfqd, cfqq, 1);
-
- cfqq->slice_end = 0;
- cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
* Called when a new fs request (rq) is added (to cfqq). Check if there's
* something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
*/
static void
cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
{
struct cfq_io_context *cic = RQ_CIC(rq);
- cfqd->rq_queued++;
if (rq_is_meta(rq))
cfqq->meta_pending++;
- cfq_update_io_thinktime(cfqd, cic);
cfq_update_io_seektime(cfqd, cic, rq);
- cfq_update_idle_window(cfqd, cfqq, cic);
cic->last_request_pos = rq->sector + rq->nr_sectors;
-
- if (cfqq == cfqd->active_queue) {
- /*
- * Remember that we saw a request from this process, but
- * don't start queuing just yet. Otherwise we risk seeing lots
- * of tiny requests, because we disrupt the normal plugging
- * and merging. If the request is already larger than a single
- * page, let it rip immediately. For that case we assume that
- * merging is already done. Ditto for a busy system that
- * has other work pending, don't risk delaying until the
- * idle timer unplug to continue working.
- */
- if (cfq_cfqq_wait_request(cfqq)) {
- if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
- cfqd->busy_queues > 1) {
- del_timer(&cfqd->idle_slice_timer);
- blk_start_queueing(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
- }
- } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
- /*
- * not the active queue - expire current slice if it is
- * idle and has expired it's mean thinktime or this new queue
- * has some old slice time left and is of higher priority or
- * this new queue is RT and the current one is BE
- */
- cfq_preempt_queue(cfqd, cfqq);
- blk_start_queueing(cfqd->queue);
- }
}
static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
cfq_rq_enqueued(cfqd, cfqq, rq);
}
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
- if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
- cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
- if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
- cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
- return;
-
- if (cfqd->hw_tag_samples++ < 50)
- return;
-
- if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
- cfqd->hw_tag = 1;
- else
- cfqd->hw_tag = 0;
-
- cfqd->hw_tag_samples = 0;
- cfqd->rq_in_driver_peak = 0;
-}
-
static void cfq_completed_request(struct request_queue *q, struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
now = jiffies;
cfq_log_cfqq(cfqd, cfqq, "complete");
- cfq_update_hw_tag(cfqd);
-
- WARN_ON(!cfqd->rq_in_driver);
- WARN_ON(!cfqq->dispatched);
- cfqd->rq_in_driver--;
- cfqq->dispatched--;
-
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight--;
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
if (sync)
RQ_CIC(rq)->last_end_request = now;
-
- /*
- * If this is the active queue, check if it needs to be expired,
- * or if we want to idle in case it has no pending requests.
- */
- if (cfqd->active_queue == cfqq) {
- const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
- if (cfq_cfqq_slice_new(cfqq)) {
- cfq_set_prio_slice(cfqd, cfqq);
- cfq_clear_cfqq_slice_new(cfqq);
- }
- /*
- * If there are no requests waiting in this queue, and
- * there are other queues ready to issue requests, AND
- * those other queues are issuing requests within our
- * mean seek distance, give them a chance to run instead
- * of idling.
- */
- if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
- else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
- sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
- }
-
- if (!cfqd->rq_in_driver)
- cfq_schedule_dispatch(cfqd);
}
/*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct io_queue *ioq = cfqq->ioq;
+
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
* users of the filesystem
*/
if (cfq_class_idle(cfqq))
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- if (cfqq->ioprio > IOPRIO_NORM)
- cfqq->ioprio = IOPRIO_NORM;
+ elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+ if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+ elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
} else {
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- if (cfqq->ioprio != cfqq->org_ioprio)
- cfqq->ioprio = cfqq->org_ioprio;
+ if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+ elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+ if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+ elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
}
}
static inline int __cfq_may_queue(struct cfq_queue *cfqq)
{
- if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
- !cfq_cfqq_must_alloc_slice(cfqq)) {
+ if ((elv_ioq_wait_request(cfqq->ioq) ||
+ cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
cfq_mark_cfqq_must_alloc_slice(cfqq);
return ELV_MQUEUE_MUST;
}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq->allocated[rw]++;
cfq_clear_cfqq_must_alloc(cfqq);
- atomic_inc(&cfqq->ref);
+ elv_get_ioq(cfqq->ioq);
spin_unlock_irqrestore(q->queue_lock, flags);
rq->elevator_private = cic;
- rq->elevator_private2 = cfqq;
+ rq->ioq = cfqq->ioq;
return 0;
queue_fail:
if (cic)
put_io_context(cic->ioc);
- cfq_schedule_dispatch(cfqd);
+ elv_schedule_dispatch(cfqd->queue);
spin_unlock_irqrestore(q->queue_lock, flags);
cfq_log(cfqd, "set_request fail");
return 1;
}
-static void cfq_kick_queue(struct work_struct *work)
-{
- struct cfq_data *cfqd =
- container_of(work, struct cfq_data, unplug_work);
- struct request_queue *q = cfqd->queue;
-
- spin_lock_irq(q->queue_lock);
- blk_start_queueing(q);
- spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
- struct cfq_data *cfqd = (struct cfq_data *) data;
- struct cfq_queue *cfqq;
- unsigned long flags;
- int timed_out = 1;
-
- cfq_log(cfqd, "idle timer fired");
-
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
- cfqq = cfqd->active_queue;
- if (cfqq) {
- timed_out = 0;
-
- /*
- * We saw a request before the queue expired, let it through
- */
- if (cfq_cfqq_must_dispatch(cfqq))
- goto out_kick;
-
- /*
- * expired
- */
- if (cfq_slice_used(cfqq))
- goto expire;
-
- /*
- * only expire and reinvoke request handler, if there are
- * other queues with pending requests
- */
- if (!cfqd->busy_queues)
- goto out_cont;
-
- /*
- * not expired and it has a request pending, let it dispatch
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto out_kick;
- }
-expire:
- cfq_slice_expired(cfqd, timed_out);
-out_kick:
- cfq_schedule_dispatch(cfqd);
-out_cont:
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
- del_timer_sync(&cfqd->idle_slice_timer);
- cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
- int i;
-
- for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
- }
-
- if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
struct request_queue *q = cfqd->queue;
- cfq_shutdown_timer_wq(cfqd);
-
spin_lock_irq(q->queue_lock);
- if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
while (!list_empty(&cfqd->cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}
- cfq_put_async_queues(cfqd);
-
spin_unlock_irq(q->queue_lock);
-
- cfq_shutdown_timer_wq(cfqd);
-
kfree(cfqd);
}
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;
- cfqd->service_tree = CFQ_RB_ROOT;
-
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->queue = q;
- init_timer(&cfqd->idle_slice_timer);
- cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
- cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
- INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
cfqd->last_end_request = jiffies;
cfqd->cfq_quantum = cfq_quantum;
cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
- cfqd->cfq_slice[0] = cfq_slice_async;
- cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
- cfqd->cfq_slice_idle = cfq_slice_idle;
- cfqd->hw_tag = 1;
return cfqd;
}
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
#undef SHOW_FUNCTION
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
#undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(fifo_expire_async),
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
- CFQ_ATTR(slice_sync),
- CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
- CFQ_ATTR(slice_idle),
__ATTR_NULL
};
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
.elevator_dispatch_fn = cfq_dispatch_requests,
.elevator_add_req_fn = cfq_insert_request,
.elevator_activate_req_fn = cfq_activate_request,
- .elevator_deactivate_req_fn = cfq_deactivate_request,
- .elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
+ .elevator_free_sched_queue_fn = cfq_free_cfq_queue,
+ .elevator_active_ioq_set_fn = cfq_active_ioq_set,
+ .elevator_active_ioq_reset_fn = cfq_active_ioq_reset,
+ .elevator_arm_slice_timer_fn = cfq_arm_slice_timer,
+ .elevator_should_preempt_fn = cfq_should_preempt,
+ .elevator_update_idle_window_fn = cfq_update_idle_window,
+ .elevator_close_cooperator_fn = cfq_close_cooperator,
},
+ .elevator_features = ELV_IOSCHED_NEED_FQ,
.elevator_attrs = cfq_attrs,
.elevator_name = "cfq",
.elevator_owner = THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
static int __init cfq_init(void)
{
- /*
- * could be 0 on HZ < 1000 setups
- */
- if (!cfq_slice_async)
- cfq_slice_async = 1;
- if (!cfq_slice_idle)
- cfq_slice_idle = 1;
-
if (cfq_slab_setup())
return -ENOMEM;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (6 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (29 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-ioc.c | 3 +
block/elevator-fq.c | 1037 +++++++++++++++++++++++++++++++++++++----
block/elevator-fq.h | 149 ++++++-
block/elevator.c | 6 +
include/linux/blkdev.h | 7 +-
include/linux/cgroup_subsys.h | 7 +
include/linux/iocontext.h | 5 +
init/Kconfig | 8 +
8 files changed, 1127 insertions(+), 95 deletions(-)
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
spin_lock_init(&ret->lock);
ret->ioprio_changed = 0;
ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ret->cgroup_changed = 0;
+#endif
ret->last_waited = jiffies; /* doesn't matter... */
ret->nr_batch_requests = 0; /* because this is 0 */
ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
+
+#define IO_DEFAULT_GRP_WEIGHT 500
+#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
}
/* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+ struct io_group *iog;
+ struct io_entity *entity, *next_active;
+
+ if (sd->active_entity != NULL)
+ /* will update/requeue at the end of service */
+ return 0;
+
+ /*
+ * NOTE: this can be improved in may ways, such as returning
+ * 1 (and thus propagating upwards the update) only when the
+ * budget changes, or caching the bfqq that will be scheduled
+ * next from this subtree. By now we worry more about
+ * correctness than about performance...
+ */
+ next_active = bfq_lookup_next_entity(sd, 0);
+ sd->next_active = next_active;
+
+ if (next_active != NULL) {
+ iog = container_of(sd, struct io_group, sched_data);
+ entity = iog->my_entity;
+ if (entity != NULL)
+ entity->budget = next_active->budget;
+ }
+
+ return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+ BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+ for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+ return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+}
+#endif
/*
* Shift for timestamp calculations. This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
bfq_update_active_tree(node);
}
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
- WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
- return IOPRIO_BE_NR - ioprio;
-}
-
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
elv_get_ioq(ioq);
}
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
- entity->ioprio = entity->new_ioprio;
- entity->ioprio_class = entity->new_ioprio_class;
- entity->sched_data = &iog->sched_data;
-}
-
/**
* bfq_find_deepest - find the deepest node that an extraction can modify.
* @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
struct io_queue *ioq = io_entity_to_ioq(entity);
if (entity->ioprio_changed) {
+ old_st->wsum -= entity->weight;
entity->ioprio = entity->new_ioprio;
entity->ioprio_class = entity->new_ioprio_class;
+ entity->weight = entity->new_weight;
entity->ioprio_changed = 0;
/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
entity->budget = elv_prio_to_slice(efqd, ioq);
}
- old_st->wsum -= entity->weight;
- entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
/*
* NOTE: here we may be changing the weight too early,
* this will cause unfairness. The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
if (add_front) {
struct io_entity *next_entity;
- /*
- * Determine the entity which will be dispatched next
- * Use sd->next_active once hierarchical patch is applied
- */
- next_entity = bfq_lookup_next_entity(sd, 0);
+ /* Determine the entity which will be dispatched next */
+ next_entity = sd->next_active;
if (next_entity && next_entity != entity) {
struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
}
/**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
* @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
*/
void bfq_activate_entity(struct io_entity *entity, int add_front)
{
- __bfq_activate_entity(entity, add_front);
+ struct io_sched_data *sd;
+
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, add_front);
+
+ add_front = 0;
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ /*
+ * No need to propagate the activation to the
+ * upper entities, as they will be updated when
+ * the active entity is rescheduled.
+ */
+ break;
+ }
}
/**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
else if (entity->tree != NULL)
BUG();
+ if (was_active || sd->next_active == entity)
+ ret = bfq_update_next_active(sd);
+
if (!requeue || !bfq_gt(entity->finish, st->vtime))
bfq_forget_entity(st, entity);
else
bfq_idle_insert(st, entity);
BUG_ON(sd->active_entity == entity);
+ BUG_ON(sd->next_active == entity);
return ret;
}
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
*/
void bfq_deactivate_entity(struct io_entity *entity, int requeue)
{
- __bfq_deactivate_entity(entity, requeue);
+ struct io_sched_data *sd;
+ struct io_entity *parent;
+
+ for_each_entity_safe(entity, parent) {
+ sd = entity->sched_data;
+
+ if (!__bfq_deactivate_entity(entity, requeue))
+ /*
+ * The parent entity is still backlogged, and
+ * we don't need to update it as it is still
+ * under service.
+ */
+ break;
+
+ if (sd->next_active != NULL)
+ /*
+ * The parent entity is still backlogged and
+ * the budgets on the path towards the root
+ * need to be updated.
+ */
+ goto update;
+
+ /*
+ * If we reach there the parent is no more backlogged and
+ * we want to propagate the dequeue upwards.
+ */
+ requeue = 1;
+ }
+
+ return;
+
+update:
+ entity = parent;
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, 0);
+
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ break;
+ }
}
/**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
entity = __bfq_lookup_next_entity(st);
if (entity != NULL) {
if (extract) {
+ bfq_check_next_active(sd, entity);
bfq_active_extract(st, entity);
sd->active_entity = entity;
+ sd->next_active = NULL;
}
break;
}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
{
struct io_service_tree *st;
- st = io_entity_service_tree(entity);
- entity->service += served;
- BUG_ON(st->wsum == 0);
- st->vtime += bfq_delta(served, st->wsum);
- bfq_forget_idle(st);
+ for_each_entity(entity) {
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+ }
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+ .weight = IO_DEFAULT_GRP_WEIGHT,
+ .ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+ struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp. Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ __key = rcu_dereference(iog->key);
+ if (__key == key)
+ return iog;
+ }
+
+ return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ struct cgroup *cgroup;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ cgroup = task_cgroup(current, io_subsys_id);
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, efqd);
+ return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct io_entity *entity = &iog->entity;
+
+ entity->weight = entity->new_weight = iocg->weight;
+ entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+ entity->ioprio_changed = 1;
+ entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+ struct io_entity *entity;
+
+ BUG_ON(parent == NULL);
+ BUG_ON(iog == NULL);
+
+ entity = &iog->entity;
+ entity->parent = parent->my_entity;
+ entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+ struct io_entity *entity = st->first_idle;
+
+ for (; entity != NULL; entity = st->first_idle)
+ __bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct io_cgroup *iocg; \
+ u64 ret; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ spin_lock_irq(&iocg->lock); \
+ ret = iocg->__VAR; \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return ret; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
+ struct cftype *cftype, \
+ u64 val) \
+{ \
+ struct io_cgroup *iocg; \
+ struct io_group *iog; \
+ struct hlist_node *n; \
+ \
+ if (val < (__MIN) || val > (__MAX)) \
+ return -EINVAL; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ \
+ spin_lock_irq(&iocg->lock); \
+ iocg->__VAR = (unsigned long)val; \
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ iog->entity.new_##__VAR = (unsigned long)val; \
+ smp_wmb(); \
+ iog->entity.ioprio_changed = 1; \
+ } \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return 0; \
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *leaf = NULL, *prev = NULL;
+ gfp_t flags = GFP_ATOMIC | __GFP_ZERO;
+
+ for (; cgroup != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL) {
+ /*
+ * All the cgroups in the path from there to the
+ * root must have a bfq_group for bfqd, so we don't
+ * need any more allocations.
+ */
+ break;
+ }
+
+ iog = kzalloc_node(sizeof(*iog), flags, q->node);
+ if (!iog)
+ goto cleanup;
+
+ io_group_init_entity(iocg, iog);
+ iog->my_entity = &iog->entity;
+
+ if (leaf == NULL) {
+ leaf = iog;
+ prev = leaf;
+ } else {
+ io_group_set_parent(prev, iog);
+ /*
+ * Build a list of allocated nodes using the bfqd
+ * filed, that is still unused and will be initialized
+ * only after the node will be connected.
+ */
+ prev->key = iog;
+ prev = iog;
+ }
+ }
+
+ return leaf;
+
+cleanup:
+ while (leaf != NULL) {
+ prev = leaf;
+ leaf = leaf->key;
+ kfree(prev);
+ }
+
+ return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+ struct cgroup *cgroup,
+ struct io_group *leaf,
+ struct elv_fq_data *efqd)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *next, *prev = NULL;
+ unsigned long flags;
+
+ assert_spin_locked(q->queue_lock);
+
+ for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ next = leaf->key;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ BUG_ON(iog != NULL);
+
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ rcu_assign_pointer(leaf->key, key);
+ hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+ spin_unlock_irqrestore(&iocg->lock, flags);
+
+ prev = leaf;
+ leaf = next;
+ }
+
+ BUG_ON(cgroup == NULL && leaf != NULL);
+
+ if (cgroup != NULL && prev != NULL) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, key);
+ io_group_set_parent(prev, iog);
+ }
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary. When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak. If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+ struct cgroup *cgroup, struct elv_fq_data *efqd,
+ int create)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct io_group *iog = NULL;
+ /* Note: Use efqd as key */
+ void *key = efqd;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL || !create)
+ return iog;
+
+ iog = io_group_chain_alloc(q, key, cgroup);
+ if (iog != NULL)
+ io_group_chain_link(q, key, cgroup, iog, efqd);
+
+ return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ struct cgroup *cgroup;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, io_subsys_id);
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ /*
+ * bio merge functions doing lookup don't want to
+ * map bio to root group by default
+ */
+ iog = NULL;
+ }
+ rcu_read_unlock();
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_cgroup *iocg = &io_root_cgroup;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_group *iog = efqd->root_group;
+
+ BUG_ON(!iog);
+ spin_lock_irq(&iocg->lock);
+ hlist_del_rcu(&iog->group_node);
+ spin_unlock_irq(&iocg->lock);
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ iog->entity.parent = NULL;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ iocg = &io_root_cgroup;
+ spin_lock_irq(&iocg->lock);
+ rcu_assign_pointer(iog->key, key);
+ hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+ spin_unlock_irq(&iocg->lock);
+
+ return iog;
+}
+
+struct cftype bfqio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = io_cgroup_weight_read,
+ .write_u64 = io_cgroup_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = io_cgroup_ioprio_class_read,
+ .write_u64 = io_cgroup_ioprio_class_write,
+ },
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, bfqio_files,
+ ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+
+ if (cgroup->parent != NULL) {
+ iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+ if (iocg == NULL)
+ return ERR_PTR(-ENOMEM);
+ } else
+ iocg = &io_root_cgroup;
+
+ spin_lock_init(&iocg->lock);
+ INIT_HLIST_HEAD(&iocg->group_data);
+ iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+ iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+ return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures. By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct task_struct *tsk)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+ /*
+ * ioc == NULL means that the task is either too young or
+ * exiting: if it has still no ioc the ioc can't be shared,
+ * if the task is exiting the attach will fail anyway, no
+ * matter what we return here.
+ */
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+ int busy, resume;
+ struct io_entity *entity = &ioq->entity;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ busy = elv_ioq_busy(ioq);
+ resume = !!ioq->nr_queued;
+
+ BUG_ON(resume && !entity->on_st);
+ BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+ /*
+ * We could be moving an queue which is on idle tree of previous group
+ * What to do? I guess anyway this queue does not have any requests.
+ * just forget the entity and free up from idle tree.
+ *
+ * This needs cleanup. Hackish.
+ */
+ if (entity->tree == &st->idle) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+ bfq_put_idle_entity(st, entity);
+ }
+
+ if (busy) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+
+ if (!resume)
+ elv_del_ioq_busy(e, ioq, 0);
+ else
+ elv_deactivate_ioq(efqd, ioq, 0);
+ }
+
+ /*
+ * Here we use a reference to bfqg. We don't need a refcounter
+ * as the cgroup reference will not be dropped, so that its
+ * destroy() callback will not be invoked.
+ */
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+
+ if (busy && resume)
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+ struct elevator_queue *eq;
+ struct io_entity *entity = iog->my_entity;
+ struct io_service_tree *st;
+ int i;
+
+ eq = container_of(efqd, struct elevator_queue, efqd);
+ hlist_del(&iog->elv_data_node);
+ __bfq_deactivate_entity(entity, 0);
+ io_put_io_group_queues(eq, iog);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ /*
+ * The idle tree may still contain bfq_queues belonging
+ * to exited task because they never migrated to a different
+ * cgroup from the one being destroyed now. Noone else
+ * can access them so it's safe to act without any lock.
+ */
+ io_flush_idle_tree(st);
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct elv_fq_data *efqd = NULL;
+ unsigned long uninitialized_var(flags);
+
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
+
+ /*
+ * io groups are linked in two lists. One list is maintained
+ * in elevator (efqd->group_list) and other is maintained
+ * per cgroup structure (iocg->group_data).
+ *
+ * While a cgroup is being deleted, elevator also might be
+ * exiting and both might try to cleanup the same io group
+ * so need to be little careful.
+ *
+ * Following code first accesses efqd under RCU to make sure
+ * iog->key is pointing to valid efqd and then takes the
+ * associated queue lock. After gettting the queue lock it
+ * again checks whether elevator exit path had alreday got
+ * hold of io group (iog->key == NULL). If yes, it does not
+ * try to free up async queues again or flush the idle tree.
+ */
+
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ spin_lock_irqsave(efqd->queue->queue_lock, flags);
+ if (iog->key == efqd)
+ __io_destroy_group(efqd, iog);
+ spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+ }
+ rcu_read_unlock();
+
+ /*
+ * No need to defer the kfree() to the end of the RCU grace
+ * period: we are called from the destroy() callback of our
+ * cgroup, so we can be sure that noone is a) still using
+ * this cgroup or b) doing lookups in it.
+ */
+ kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+
+ /*
+ * Since we are destroying the cgroup, there are no more tasks
+ * referencing it, and all the RCU grace periods that may have
+ * referenced it are ended (as the destruction of the parent
+ * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+ * anything else and we don't need any synchronization.
+ */
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+ io_destroy_group(iocg, iog);
+
+ BUG_ON(!hlist_empty(&iocg->group_data));
+
+ kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+ struct hlist_node *pos, *n;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+ elv_data_node) {
+ hlist_del(&iog->elv_data_node);
+
+ __bfq_deactivate_entity(iog->my_entity, 0);
+
+ /*
+ * Don't remove from the group hash, just set an
+ * invalid key. No lookups can race with the
+ * assignment as bfqd is being destroyed; this
+ * implies also that new elements cannot be added
+ * to the list.
+ */
+ rcu_assign_pointer(iog->key, NULL);
+ io_put_io_group_queues(e, iog);
+ }
+}
+
+struct cgroup_subsys io_subsys = {
+ .name = "io",
+ .create = iocg_create,
+ .can_attach = iocg_can_attach,
+ .attach = iocg_attach,
+ .destroy = iocg_destroy,
+ .populate = iocg_populate,
+ .subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ struct request_queue *q = rq->q;
+ struct io_queue *ioq = rq->ioq;
+ struct io_group *iog, *__iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return 1;
+
+ /* Determine the io group of the bio submitting task */
+ iog = io_get_io_group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a differet cgroup for which io
+ * group has not been setup yet. */
+ return 0;
+ }
+
+ /* Determine the io group of the ioq, rq belongs to*/
+ __iog = ioq_to_io_group(ioq);
+
+ return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+ struct io_group *iog;
+ unsigned long flags;
+
+ /* Make sure io group hierarchy has been setup and also set the
+ * io group to which rq belongs. Later we should make use of
+ * bio cgroup patches to determine the io group */
+ spin_lock_irqsave(q->queue_lock, flags);
+ iog = io_get_io_group(q, 1);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ BUG_ON(!iog);
+
+ /* Store iog in rq. TODO: take care of referencing */
+ rq->iog = iog;
}
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
/* Elevator fair queuing function */
struct io_queue *rq_ioq(struct request *rq)
{
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
{
+ struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
+ io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
return NULL;
sd = &efqd->root_group->sched_data;
- if (extract)
- entity = bfq_lookup_next_entity(sd, 1);
- else
- entity = bfq_lookup_next_entity(sd, 0);
+ for (; sd != NULL; sd = entity->my_sched_data) {
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ /*
+ * entity can be null despite the fact that there are busy
+ * queues. if all the busy queues are under a group which is
+ * currently under service.
+ * So if we are just looking for next ioq while something is
+ * being served, null entity is not an error.
+ */
+ BUG_ON(!entity && extract);
+
+ if (extract)
+ entity->service = 0;
- BUG_ON(!entity);
- if (extract)
- entity->service = 0;
+ if (!entity)
+ return NULL;
+ }
ioq = io_entity_to_ioq(entity);
return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
struct request_queue *q = efqd->queue;
if (ioq) {
- elv_log_ioq(efqd, ioq, "set_active, busy=%d",
- efqd->busy_queues);
+ struct io_group *iog = ioq_to_io_group(ioq);
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+ " weight=%ld group_weight=%ld",
+ efqd->busy_queues,
+ ioq->entity.ioprio, ioq->entity.weight,
+ iog_weight(iog));
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
{
struct io_queue *ioq;
struct elevator_queue *eq = q->elevator;
+ struct io_group *iog = NULL, *new_iog = NULL;
ioq = elv_active_ioq(eq);
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
/*
* Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ *
+ * TODO: In hierarchical setup, one need to traverse up the hier
+ * till both the queues are children of same parent to make a
+ * decision whether to do the preemption or not. Something like
+ * what cfs has done for cpu scheduler. Will do it little later.
*/
if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
return 1;
+ iog = ioq_to_io_group(ioq);
+ new_iog = ioq_to_io_group(new_ioq);
+
/*
- * Check with io scheduler if it has additional criterion based on
- * which it wants to preempt existing queue.
+ * If both the queues belong to same group, check with io scheduler
+ * if it has additional criterion based on which it wants to
+ * preempt existing queue.
*/
+ if (iog != new_iog)
+ return 0;
+
if (eq->ops->elevator_should_preempt_fn)
return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_schedule_dispatch(q);
}
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
- struct elv_fq_data *efqd = &q->elevator->efqd;
-
- return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
int ioprio)
{
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
}
EXPORT_SYMBOL(io_group_set_async_queue);
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
- int i, j;
-
- for (i = 0; i < 2; i++)
- for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
-
- /* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
- struct elevator_queue *e, void *key)
-{
- struct io_group *iog;
- int i;
-
- iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
- if (iog == NULL)
- return NULL;
-
- for (i = 0; i < IO_IOPRIO_CLASSES; i++)
- iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
- return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
- struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
- kfree(iog);
-}
-
static void elv_slab_kill(void)
{
/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
INIT_LIST_HEAD(&efqd->idle_list);
+ INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
spin_lock_irq(q->queue_lock);
/* This should drop all the idle tree references of ioq */
elv_free_idle_ioq_list(e);
+ /* This should drop all the io group references of async queues */
+ io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
elv_shutdown_timer_wq(e);
+ /* Wait for iog->key accessors to exit their grace periods. */
+ synchronize_rcu();
BUG_ON(timer_pending(&efqd->idle_slice_timer));
io_free_root_group(e);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
*/
#include <linux/blkdev.h>
+#include <linux/cgroup.h>
#ifndef _BFQ_SCHED_H
#define _BFQ_SCHED_H
#define IO_IOPRIO_CLASSES 3
+#define WEIGHT_MAX 1000
typedef u64 bfq_timestamp_t;
typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
*/
struct io_sched_data {
struct io_entity *active_entity;
+ struct io_entity *next_active;
struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
};
@@ -84,13 +87,12 @@ struct io_sched_data {
* this entity; used for O(log N) lookups into active trees.
* @service: service received during the last round of service.
* @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
* @parent: parent entity, for hierarchical scheduling.
* @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
* associated scheduler queue, %NULL on leaf nodes.
* @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
* @ioprio_class: the ioprio_class in use.
* @new_ioprio_class: when an ioprio_class change is requested, the new
* ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
bfq_timestamp_t min_start;
bfq_service_t service, budget;
- bfq_weight_t weight;
struct io_entity *parent;
struct io_sched_data *my_sched_data;
struct io_sched_data *sched_data;
+ bfq_weight_t weight, new_weight;
unsigned short ioprio, new_ioprio;
unsigned short ioprio_class, new_ioprio_class;
@@ -180,6 +182,75 @@ struct io_queue {
void *sched_queue;
};
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ * both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ * list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ * of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ * the group, one queue per ioprio value per ioprio_class,
+ * except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ * to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ * o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ * via RCU from its readers.
+ * o @bfqd is protected by the queue lock, RCU is used to access it
+ * from the readers.
+ * o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+ struct io_entity entity;
+ struct hlist_node elv_data_node;
+ struct hlist_node group_node;
+ struct io_sched_data sched_data;
+
+ struct io_entity *my_entity;
+
+ /*
+ * A cgroup has multiple io_groups, one for each request queue.
+ * to find io group belonging to a particular queue, elv_fq_data
+ * pointer is stored as a key.
+ */
+ void *key;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+ struct cgroup_subsys_state css;
+
+ unsigned long weight, ioprio_class;
+
+ spinlock_t lock;
+ struct hlist_head group_data;
+};
+#else
struct io_group {
struct io_sched_data sched_data;
@@ -187,10 +258,14 @@ struct io_group {
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
};
+#endif
struct elv_fq_data {
struct io_group *root_group;
+ /* List of io groups hanging on this elevator */
+ struct hlist_head group_list;
+
/* List of io queues on idle tree. */
struct list_head idle_list;
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
ioq->entity.ioprio_changed = 1;
}
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
{
ioq->entity.new_ioprio = ioprio;
+ ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
ioq->entity.ioprio_changed = 1;
}
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
sched_data);
}
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ /* Just root group is present and weight is immaterial. */
+ return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
/* Functions used by blksysfs.c */
extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
{
return NULL;
}
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+ return 1;
+}
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_integrity(bio) != blk_integrity_rq(rq))
return 0;
+ /* If rq and bio belongs to different groups, dont allow merging */
+ if (!io_group_allow_merge(rq, bio))
+ return 0;
+
if (!elv_iosched_allow_merge(rq, bio))
return 0;
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_set_request_io_group(q, rq);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
#ifdef CONFIG_ELV_FAIR_QUEUING
/* io queue request belongs to */
struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* io group request belongs to */
+ struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
#endif
/* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;
+#ifdef CONFIG_GROUP_IOSCHED
+ /* If task changes the cgroup, elevator processes it asynchronously */
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.
+config GROUP_IOSCHED
+ bool "Group IO Scheduler"
+ depends on CGROUPS && ELV_FAIR_QUEUING
+ default n
+ ---help---
+ This feature lets IO scheduler recognize task groups and control
+ disk bandwidth allocation to such task groups.
+
endif # CGROUPS
config MM_OWNER
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (7 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-07 7:42 ` Gui Jianfeng
` (2 more replies)
2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
` (28 subsequent siblings)
37 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-ioc.c | 3 +
block/elevator-fq.c | 1037 +++++++++++++++++++++++++++++++++++++----
block/elevator-fq.h | 149 ++++++-
block/elevator.c | 6 +
include/linux/blkdev.h | 7 +-
include/linux/cgroup_subsys.h | 7 +
include/linux/iocontext.h | 5 +
init/Kconfig | 8 +
8 files changed, 1127 insertions(+), 95 deletions(-)
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
spin_lock_init(&ret->lock);
ret->ioprio_changed = 0;
ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ret->cgroup_changed = 0;
+#endif
ret->last_waited = jiffies; /* doesn't matter... */
ret->nr_batch_requests = 0; /* because this is 0 */
ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
+
+#define IO_DEFAULT_GRP_WEIGHT 500
+#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
}
/* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+ struct io_group *iog;
+ struct io_entity *entity, *next_active;
+
+ if (sd->active_entity != NULL)
+ /* will update/requeue at the end of service */
+ return 0;
+
+ /*
+ * NOTE: this can be improved in may ways, such as returning
+ * 1 (and thus propagating upwards the update) only when the
+ * budget changes, or caching the bfqq that will be scheduled
+ * next from this subtree. By now we worry more about
+ * correctness than about performance...
+ */
+ next_active = bfq_lookup_next_entity(sd, 0);
+ sd->next_active = next_active;
+
+ if (next_active != NULL) {
+ iog = container_of(sd, struct io_group, sched_data);
+ entity = iog->my_entity;
+ if (entity != NULL)
+ entity->budget = next_active->budget;
+ }
+
+ return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+ BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+ for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+ return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+}
+#endif
/*
* Shift for timestamp calculations. This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
bfq_update_active_tree(node);
}
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
- WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
- return IOPRIO_BE_NR - ioprio;
-}
-
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
elv_get_ioq(ioq);
}
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
- entity->ioprio = entity->new_ioprio;
- entity->ioprio_class = entity->new_ioprio_class;
- entity->sched_data = &iog->sched_data;
-}
-
/**
* bfq_find_deepest - find the deepest node that an extraction can modify.
* @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
struct io_queue *ioq = io_entity_to_ioq(entity);
if (entity->ioprio_changed) {
+ old_st->wsum -= entity->weight;
entity->ioprio = entity->new_ioprio;
entity->ioprio_class = entity->new_ioprio_class;
+ entity->weight = entity->new_weight;
entity->ioprio_changed = 0;
/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
entity->budget = elv_prio_to_slice(efqd, ioq);
}
- old_st->wsum -= entity->weight;
- entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
/*
* NOTE: here we may be changing the weight too early,
* this will cause unfairness. The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
if (add_front) {
struct io_entity *next_entity;
- /*
- * Determine the entity which will be dispatched next
- * Use sd->next_active once hierarchical patch is applied
- */
- next_entity = bfq_lookup_next_entity(sd, 0);
+ /* Determine the entity which will be dispatched next */
+ next_entity = sd->next_active;
if (next_entity && next_entity != entity) {
struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
}
/**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
* @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
*/
void bfq_activate_entity(struct io_entity *entity, int add_front)
{
- __bfq_activate_entity(entity, add_front);
+ struct io_sched_data *sd;
+
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, add_front);
+
+ add_front = 0;
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ /*
+ * No need to propagate the activation to the
+ * upper entities, as they will be updated when
+ * the active entity is rescheduled.
+ */
+ break;
+ }
}
/**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
else if (entity->tree != NULL)
BUG();
+ if (was_active || sd->next_active == entity)
+ ret = bfq_update_next_active(sd);
+
if (!requeue || !bfq_gt(entity->finish, st->vtime))
bfq_forget_entity(st, entity);
else
bfq_idle_insert(st, entity);
BUG_ON(sd->active_entity == entity);
+ BUG_ON(sd->next_active == entity);
return ret;
}
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
*/
void bfq_deactivate_entity(struct io_entity *entity, int requeue)
{
- __bfq_deactivate_entity(entity, requeue);
+ struct io_sched_data *sd;
+ struct io_entity *parent;
+
+ for_each_entity_safe(entity, parent) {
+ sd = entity->sched_data;
+
+ if (!__bfq_deactivate_entity(entity, requeue))
+ /*
+ * The parent entity is still backlogged, and
+ * we don't need to update it as it is still
+ * under service.
+ */
+ break;
+
+ if (sd->next_active != NULL)
+ /*
+ * The parent entity is still backlogged and
+ * the budgets on the path towards the root
+ * need to be updated.
+ */
+ goto update;
+
+ /*
+ * If we reach there the parent is no more backlogged and
+ * we want to propagate the dequeue upwards.
+ */
+ requeue = 1;
+ }
+
+ return;
+
+update:
+ entity = parent;
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, 0);
+
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ break;
+ }
}
/**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
entity = __bfq_lookup_next_entity(st);
if (entity != NULL) {
if (extract) {
+ bfq_check_next_active(sd, entity);
bfq_active_extract(st, entity);
sd->active_entity = entity;
+ sd->next_active = NULL;
}
break;
}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
{
struct io_service_tree *st;
- st = io_entity_service_tree(entity);
- entity->service += served;
- BUG_ON(st->wsum == 0);
- st->vtime += bfq_delta(served, st->wsum);
- bfq_forget_idle(st);
+ for_each_entity(entity) {
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+ }
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+ .weight = IO_DEFAULT_GRP_WEIGHT,
+ .ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+ struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp. Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ __key = rcu_dereference(iog->key);
+ if (__key == key)
+ return iog;
+ }
+
+ return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ struct cgroup *cgroup;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ cgroup = task_cgroup(current, io_subsys_id);
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, efqd);
+ return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct io_entity *entity = &iog->entity;
+
+ entity->weight = entity->new_weight = iocg->weight;
+ entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+ entity->ioprio_changed = 1;
+ entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+ struct io_entity *entity;
+
+ BUG_ON(parent == NULL);
+ BUG_ON(iog == NULL);
+
+ entity = &iog->entity;
+ entity->parent = parent->my_entity;
+ entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+ struct io_entity *entity = st->first_idle;
+
+ for (; entity != NULL; entity = st->first_idle)
+ __bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct io_cgroup *iocg; \
+ u64 ret; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ spin_lock_irq(&iocg->lock); \
+ ret = iocg->__VAR; \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return ret; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
+ struct cftype *cftype, \
+ u64 val) \
+{ \
+ struct io_cgroup *iocg; \
+ struct io_group *iog; \
+ struct hlist_node *n; \
+ \
+ if (val < (__MIN) || val > (__MAX)) \
+ return -EINVAL; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ \
+ spin_lock_irq(&iocg->lock); \
+ iocg->__VAR = (unsigned long)val; \
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ iog->entity.new_##__VAR = (unsigned long)val; \
+ smp_wmb(); \
+ iog->entity.ioprio_changed = 1; \
+ } \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return 0; \
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *leaf = NULL, *prev = NULL;
+ gfp_t flags = GFP_ATOMIC | __GFP_ZERO;
+
+ for (; cgroup != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL) {
+ /*
+ * All the cgroups in the path from there to the
+ * root must have a bfq_group for bfqd, so we don't
+ * need any more allocations.
+ */
+ break;
+ }
+
+ iog = kzalloc_node(sizeof(*iog), flags, q->node);
+ if (!iog)
+ goto cleanup;
+
+ io_group_init_entity(iocg, iog);
+ iog->my_entity = &iog->entity;
+
+ if (leaf == NULL) {
+ leaf = iog;
+ prev = leaf;
+ } else {
+ io_group_set_parent(prev, iog);
+ /*
+ * Build a list of allocated nodes using the bfqd
+ * filed, that is still unused and will be initialized
+ * only after the node will be connected.
+ */
+ prev->key = iog;
+ prev = iog;
+ }
+ }
+
+ return leaf;
+
+cleanup:
+ while (leaf != NULL) {
+ prev = leaf;
+ leaf = leaf->key;
+ kfree(prev);
+ }
+
+ return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+ struct cgroup *cgroup,
+ struct io_group *leaf,
+ struct elv_fq_data *efqd)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *next, *prev = NULL;
+ unsigned long flags;
+
+ assert_spin_locked(q->queue_lock);
+
+ for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ next = leaf->key;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ BUG_ON(iog != NULL);
+
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ rcu_assign_pointer(leaf->key, key);
+ hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+ spin_unlock_irqrestore(&iocg->lock, flags);
+
+ prev = leaf;
+ leaf = next;
+ }
+
+ BUG_ON(cgroup == NULL && leaf != NULL);
+
+ if (cgroup != NULL && prev != NULL) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, key);
+ io_group_set_parent(prev, iog);
+ }
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary. When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak. If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+ struct cgroup *cgroup, struct elv_fq_data *efqd,
+ int create)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct io_group *iog = NULL;
+ /* Note: Use efqd as key */
+ void *key = efqd;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL || !create)
+ return iog;
+
+ iog = io_group_chain_alloc(q, key, cgroup);
+ if (iog != NULL)
+ io_group_chain_link(q, key, cgroup, iog, efqd);
+
+ return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ struct cgroup *cgroup;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, io_subsys_id);
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ /*
+ * bio merge functions doing lookup don't want to
+ * map bio to root group by default
+ */
+ iog = NULL;
+ }
+ rcu_read_unlock();
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_cgroup *iocg = &io_root_cgroup;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_group *iog = efqd->root_group;
+
+ BUG_ON(!iog);
+ spin_lock_irq(&iocg->lock);
+ hlist_del_rcu(&iog->group_node);
+ spin_unlock_irq(&iocg->lock);
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ iog->entity.parent = NULL;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ iocg = &io_root_cgroup;
+ spin_lock_irq(&iocg->lock);
+ rcu_assign_pointer(iog->key, key);
+ hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+ spin_unlock_irq(&iocg->lock);
+
+ return iog;
+}
+
+struct cftype bfqio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = io_cgroup_weight_read,
+ .write_u64 = io_cgroup_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = io_cgroup_ioprio_class_read,
+ .write_u64 = io_cgroup_ioprio_class_write,
+ },
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, bfqio_files,
+ ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+
+ if (cgroup->parent != NULL) {
+ iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+ if (iocg == NULL)
+ return ERR_PTR(-ENOMEM);
+ } else
+ iocg = &io_root_cgroup;
+
+ spin_lock_init(&iocg->lock);
+ INIT_HLIST_HEAD(&iocg->group_data);
+ iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+ iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+ return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures. By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct task_struct *tsk)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+ /*
+ * ioc == NULL means that the task is either too young or
+ * exiting: if it has still no ioc the ioc can't be shared,
+ * if the task is exiting the attach will fail anyway, no
+ * matter what we return here.
+ */
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+ int busy, resume;
+ struct io_entity *entity = &ioq->entity;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ busy = elv_ioq_busy(ioq);
+ resume = !!ioq->nr_queued;
+
+ BUG_ON(resume && !entity->on_st);
+ BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+ /*
+ * We could be moving an queue which is on idle tree of previous group
+ * What to do? I guess anyway this queue does not have any requests.
+ * just forget the entity and free up from idle tree.
+ *
+ * This needs cleanup. Hackish.
+ */
+ if (entity->tree == &st->idle) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+ bfq_put_idle_entity(st, entity);
+ }
+
+ if (busy) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+
+ if (!resume)
+ elv_del_ioq_busy(e, ioq, 0);
+ else
+ elv_deactivate_ioq(efqd, ioq, 0);
+ }
+
+ /*
+ * Here we use a reference to bfqg. We don't need a refcounter
+ * as the cgroup reference will not be dropped, so that its
+ * destroy() callback will not be invoked.
+ */
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+
+ if (busy && resume)
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+ struct elevator_queue *eq;
+ struct io_entity *entity = iog->my_entity;
+ struct io_service_tree *st;
+ int i;
+
+ eq = container_of(efqd, struct elevator_queue, efqd);
+ hlist_del(&iog->elv_data_node);
+ __bfq_deactivate_entity(entity, 0);
+ io_put_io_group_queues(eq, iog);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ /*
+ * The idle tree may still contain bfq_queues belonging
+ * to exited task because they never migrated to a different
+ * cgroup from the one being destroyed now. Noone else
+ * can access them so it's safe to act without any lock.
+ */
+ io_flush_idle_tree(st);
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct elv_fq_data *efqd = NULL;
+ unsigned long uninitialized_var(flags);
+
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
+
+ /*
+ * io groups are linked in two lists. One list is maintained
+ * in elevator (efqd->group_list) and other is maintained
+ * per cgroup structure (iocg->group_data).
+ *
+ * While a cgroup is being deleted, elevator also might be
+ * exiting and both might try to cleanup the same io group
+ * so need to be little careful.
+ *
+ * Following code first accesses efqd under RCU to make sure
+ * iog->key is pointing to valid efqd and then takes the
+ * associated queue lock. After gettting the queue lock it
+ * again checks whether elevator exit path had alreday got
+ * hold of io group (iog->key == NULL). If yes, it does not
+ * try to free up async queues again or flush the idle tree.
+ */
+
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ spin_lock_irqsave(efqd->queue->queue_lock, flags);
+ if (iog->key == efqd)
+ __io_destroy_group(efqd, iog);
+ spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+ }
+ rcu_read_unlock();
+
+ /*
+ * No need to defer the kfree() to the end of the RCU grace
+ * period: we are called from the destroy() callback of our
+ * cgroup, so we can be sure that noone is a) still using
+ * this cgroup or b) doing lookups in it.
+ */
+ kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+
+ /*
+ * Since we are destroying the cgroup, there are no more tasks
+ * referencing it, and all the RCU grace periods that may have
+ * referenced it are ended (as the destruction of the parent
+ * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+ * anything else and we don't need any synchronization.
+ */
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+ io_destroy_group(iocg, iog);
+
+ BUG_ON(!hlist_empty(&iocg->group_data));
+
+ kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+ struct hlist_node *pos, *n;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+ elv_data_node) {
+ hlist_del(&iog->elv_data_node);
+
+ __bfq_deactivate_entity(iog->my_entity, 0);
+
+ /*
+ * Don't remove from the group hash, just set an
+ * invalid key. No lookups can race with the
+ * assignment as bfqd is being destroyed; this
+ * implies also that new elements cannot be added
+ * to the list.
+ */
+ rcu_assign_pointer(iog->key, NULL);
+ io_put_io_group_queues(e, iog);
+ }
+}
+
+struct cgroup_subsys io_subsys = {
+ .name = "io",
+ .create = iocg_create,
+ .can_attach = iocg_can_attach,
+ .attach = iocg_attach,
+ .destroy = iocg_destroy,
+ .populate = iocg_populate,
+ .subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ struct request_queue *q = rq->q;
+ struct io_queue *ioq = rq->ioq;
+ struct io_group *iog, *__iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return 1;
+
+ /* Determine the io group of the bio submitting task */
+ iog = io_get_io_group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a differet cgroup for which io
+ * group has not been setup yet. */
+ return 0;
+ }
+
+ /* Determine the io group of the ioq, rq belongs to*/
+ __iog = ioq_to_io_group(ioq);
+
+ return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+ struct io_group *iog;
+ unsigned long flags;
+
+ /* Make sure io group hierarchy has been setup and also set the
+ * io group to which rq belongs. Later we should make use of
+ * bio cgroup patches to determine the io group */
+ spin_lock_irqsave(q->queue_lock, flags);
+ iog = io_get_io_group(q, 1);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ BUG_ON(!iog);
+
+ /* Store iog in rq. TODO: take care of referencing */
+ rq->iog = iog;
}
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
/* Elevator fair queuing function */
struct io_queue *rq_ioq(struct request *rq)
{
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
{
+ struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
+ io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
return NULL;
sd = &efqd->root_group->sched_data;
- if (extract)
- entity = bfq_lookup_next_entity(sd, 1);
- else
- entity = bfq_lookup_next_entity(sd, 0);
+ for (; sd != NULL; sd = entity->my_sched_data) {
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ /*
+ * entity can be null despite the fact that there are busy
+ * queues. if all the busy queues are under a group which is
+ * currently under service.
+ * So if we are just looking for next ioq while something is
+ * being served, null entity is not an error.
+ */
+ BUG_ON(!entity && extract);
+
+ if (extract)
+ entity->service = 0;
- BUG_ON(!entity);
- if (extract)
- entity->service = 0;
+ if (!entity)
+ return NULL;
+ }
ioq = io_entity_to_ioq(entity);
return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
struct request_queue *q = efqd->queue;
if (ioq) {
- elv_log_ioq(efqd, ioq, "set_active, busy=%d",
- efqd->busy_queues);
+ struct io_group *iog = ioq_to_io_group(ioq);
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+ " weight=%ld group_weight=%ld",
+ efqd->busy_queues,
+ ioq->entity.ioprio, ioq->entity.weight,
+ iog_weight(iog));
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
{
struct io_queue *ioq;
struct elevator_queue *eq = q->elevator;
+ struct io_group *iog = NULL, *new_iog = NULL;
ioq = elv_active_ioq(eq);
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
/*
* Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ *
+ * TODO: In hierarchical setup, one need to traverse up the hier
+ * till both the queues are children of same parent to make a
+ * decision whether to do the preemption or not. Something like
+ * what cfs has done for cpu scheduler. Will do it little later.
*/
if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
return 1;
+ iog = ioq_to_io_group(ioq);
+ new_iog = ioq_to_io_group(new_ioq);
+
/*
- * Check with io scheduler if it has additional criterion based on
- * which it wants to preempt existing queue.
+ * If both the queues belong to same group, check with io scheduler
+ * if it has additional criterion based on which it wants to
+ * preempt existing queue.
*/
+ if (iog != new_iog)
+ return 0;
+
if (eq->ops->elevator_should_preempt_fn)
return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_schedule_dispatch(q);
}
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
- struct elv_fq_data *efqd = &q->elevator->efqd;
-
- return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
int ioprio)
{
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
}
EXPORT_SYMBOL(io_group_set_async_queue);
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
- int i, j;
-
- for (i = 0; i < 2; i++)
- for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
-
- /* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
- struct elevator_queue *e, void *key)
-{
- struct io_group *iog;
- int i;
-
- iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
- if (iog == NULL)
- return NULL;
-
- for (i = 0; i < IO_IOPRIO_CLASSES; i++)
- iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
- return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
- struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
- kfree(iog);
-}
-
static void elv_slab_kill(void)
{
/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
INIT_LIST_HEAD(&efqd->idle_list);
+ INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
spin_lock_irq(q->queue_lock);
/* This should drop all the idle tree references of ioq */
elv_free_idle_ioq_list(e);
+ /* This should drop all the io group references of async queues */
+ io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
elv_shutdown_timer_wq(e);
+ /* Wait for iog->key accessors to exit their grace periods. */
+ synchronize_rcu();
BUG_ON(timer_pending(&efqd->idle_slice_timer));
io_free_root_group(e);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
*/
#include <linux/blkdev.h>
+#include <linux/cgroup.h>
#ifndef _BFQ_SCHED_H
#define _BFQ_SCHED_H
#define IO_IOPRIO_CLASSES 3
+#define WEIGHT_MAX 1000
typedef u64 bfq_timestamp_t;
typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
*/
struct io_sched_data {
struct io_entity *active_entity;
+ struct io_entity *next_active;
struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
};
@@ -84,13 +87,12 @@ struct io_sched_data {
* this entity; used for O(log N) lookups into active trees.
* @service: service received during the last round of service.
* @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
* @parent: parent entity, for hierarchical scheduling.
* @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
* associated scheduler queue, %NULL on leaf nodes.
* @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
* @ioprio_class: the ioprio_class in use.
* @new_ioprio_class: when an ioprio_class change is requested, the new
* ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
bfq_timestamp_t min_start;
bfq_service_t service, budget;
- bfq_weight_t weight;
struct io_entity *parent;
struct io_sched_data *my_sched_data;
struct io_sched_data *sched_data;
+ bfq_weight_t weight, new_weight;
unsigned short ioprio, new_ioprio;
unsigned short ioprio_class, new_ioprio_class;
@@ -180,6 +182,75 @@ struct io_queue {
void *sched_queue;
};
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ * both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ * list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ * of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ * the group, one queue per ioprio value per ioprio_class,
+ * except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ * to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ * o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ * via RCU from its readers.
+ * o @bfqd is protected by the queue lock, RCU is used to access it
+ * from the readers.
+ * o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+ struct io_entity entity;
+ struct hlist_node elv_data_node;
+ struct hlist_node group_node;
+ struct io_sched_data sched_data;
+
+ struct io_entity *my_entity;
+
+ /*
+ * A cgroup has multiple io_groups, one for each request queue.
+ * to find io group belonging to a particular queue, elv_fq_data
+ * pointer is stored as a key.
+ */
+ void *key;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+ struct cgroup_subsys_state css;
+
+ unsigned long weight, ioprio_class;
+
+ spinlock_t lock;
+ struct hlist_head group_data;
+};
+#else
struct io_group {
struct io_sched_data sched_data;
@@ -187,10 +258,14 @@ struct io_group {
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
};
+#endif
struct elv_fq_data {
struct io_group *root_group;
+ /* List of io groups hanging on this elevator */
+ struct hlist_head group_list;
+
/* List of io queues on idle tree. */
struct list_head idle_list;
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
ioq->entity.ioprio_changed = 1;
}
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
{
ioq->entity.new_ioprio = ioprio;
+ ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
ioq->entity.ioprio_changed = 1;
}
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
sched_data);
}
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ /* Just root group is present and weight is immaterial. */
+ return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
/* Functions used by blksysfs.c */
extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
{
return NULL;
}
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+ return 1;
+}
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_integrity(bio) != blk_integrity_rq(rq))
return 0;
+ /* If rq and bio belongs to different groups, dont allow merging */
+ if (!io_group_allow_merge(rq, bio))
+ return 0;
+
if (!elv_iosched_allow_merge(rq, bio))
return 0;
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_set_request_io_group(q, rq);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
#ifdef CONFIG_ELV_FAIR_QUEUING
/* io queue request belongs to */
struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* io group request belongs to */
+ struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
#endif
/* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;
+#ifdef CONFIG_GROUP_IOSCHED
+ /* If task changes the cgroup, elevator processes it asynchronously */
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.
+config GROUP_IOSCHED
+ bool "Group IO Scheduler"
+ depends on CGROUPS && ELV_FAIR_QUEUING
+ default n
+ ---help---
+ This feature lets IO scheduler recognize task groups and control
+ disk bandwidth allocation to such task groups.
+
endif # CGROUPS
config MM_OWNER
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-07 7:42 ` Gui Jianfeng
2009-05-07 8:05 ` Li Zefan
` (2 more replies)
[not found] ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-08 21:09 ` Andrea Righi
2 siblings, 3 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07 7:42 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> This patch enables hierarchical fair queuing in common layer. It is
> controlled by config option CONFIG_GROUP_IOSCHED.
...
> +}
> +
> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> + struct hlist_node *n, *tmp;
> + struct io_group *iog;
> +
> + /*
> + * Since we are destroying the cgroup, there are no more tasks
> + * referencing it, and all the RCU grace periods that may have
> + * referenced it are ended (as the destruction of the parent
> + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> + * anything else and we don't need any synchronization.
> + */
> + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> + io_destroy_group(iocg, iog);
> +
> + BUG_ON(!hlist_empty(&iocg->group_data));
> +
Hi Vivek,
IMHO, free_css_id() needs to be called here.
> + kfree(iocg);
> +}
> +
> +void io_disconnect_groups(struct elevator_queue *e)
> +{
> + struct hlist_node *pos, *n;
> + struct io_group *iog;
> + struct elv_fq_data *efqd = &e->efqd;
> +
> + hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
> + elv_data_node) {
> + hlist_del(&iog->elv_data_node);
> +
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-07 7:42 ` Gui Jianfeng
@ 2009-05-07 8:05 ` Li Zefan
[not found] ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-08 12:45 ` Vivek Goyal
2 siblings, 0 replies; 297+ messages in thread
From: Li Zefan @ 2009-05-07 8:05 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Vivek Goyal, nauman, dpshah, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> This patch enables hierarchical fair queuing in common layer. It is
>> controlled by config option CONFIG_GROUP_IOSCHED.
> ...
>> +}
>> +
>> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> +{
>> + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>> + struct hlist_node *n, *tmp;
>> + struct io_group *iog;
>> +
>> + /*
>> + * Since we are destroying the cgroup, there are no more tasks
>> + * referencing it, and all the RCU grace periods that may have
>> + * referenced it are ended (as the destruction of the parent
>> + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
>> + * anything else and we don't need any synchronization.
>> + */
>> + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
>> + io_destroy_group(iocg, iog);
>> +
>> + BUG_ON(!hlist_empty(&iocg->group_data));
>> +
>
> Hi Vivek,
>
> IMHO, free_css_id() needs to be called here.
>
Right.
Though alloc_css_id() is called by cgroup core in cgroup_create(),
free_css_id() should be called by subsystem itself.
This is a bit strange, but it's required by memory cgroup. Normally,
free_css_id() is called in destroy() handler, but memcg calls it
when a mem_cgroup's refcnt goes to 0. When a cgroup is destroyed,
the mem_cgroup won't be destroyed (refcnt > 0) if it has records on
swap-entry.
>> + kfree(iocg);
>> +}
>> +
>> +void io_disconnect_groups(struct elevator_queue *e)
>> +{
>> + struct hlist_node *pos, *n;
>> + struct io_group *iog;
>> + struct elv_fq_data *efqd = &e->efqd;
>> +
>> + hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
>> + elv_data_node) {
>> + hlist_del(&iog->elv_data_node);
>> +
>
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
[not found] ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-07 8:05 ` Li Zefan
2009-05-08 12:45 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Li Zefan @ 2009-05-07 8:05 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> This patch enables hierarchical fair queuing in common layer. It is
>> controlled by config option CONFIG_GROUP_IOSCHED.
> ...
>> +}
>> +
>> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> +{
>> + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>> + struct hlist_node *n, *tmp;
>> + struct io_group *iog;
>> +
>> + /*
>> + * Since we are destroying the cgroup, there are no more tasks
>> + * referencing it, and all the RCU grace periods that may have
>> + * referenced it are ended (as the destruction of the parent
>> + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
>> + * anything else and we don't need any synchronization.
>> + */
>> + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
>> + io_destroy_group(iocg, iog);
>> +
>> + BUG_ON(!hlist_empty(&iocg->group_data));
>> +
>
> Hi Vivek,
>
> IMHO, free_css_id() needs to be called here.
>
Right.
Though alloc_css_id() is called by cgroup core in cgroup_create(),
free_css_id() should be called by subsystem itself.
This is a bit strange, but it's required by memory cgroup. Normally,
free_css_id() is called in destroy() handler, but memcg calls it
when a mem_cgroup's refcnt goes to 0. When a cgroup is destroyed,
the mem_cgroup won't be destroyed (refcnt > 0) if it has records on
swap-entry.
>> + kfree(iocg);
>> +}
>> +
>> +void io_disconnect_groups(struct elevator_queue *e)
>> +{
>> + struct hlist_node *pos, *n;
>> + struct io_group *iog;
>> + struct elv_fq_data *efqd = &e->efqd;
>> +
>> + hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
>> + elv_data_node) {
>> + hlist_del(&iog->elv_data_node);
>> +
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
[not found] ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-07 8:05 ` Li Zefan
@ 2009-05-08 12:45 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:45 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Thu, May 07, 2009 at 03:42:37PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > This patch enables hierarchical fair queuing in common layer. It is
> > controlled by config option CONFIG_GROUP_IOSCHED.
> ...
> > +}
> > +
> > +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> > +{
> > + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > + struct hlist_node *n, *tmp;
> > + struct io_group *iog;
> > +
> > + /*
> > + * Since we are destroying the cgroup, there are no more tasks
> > + * referencing it, and all the RCU grace periods that may have
> > + * referenced it are ended (as the destruction of the parent
> > + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> > + * anything else and we don't need any synchronization.
> > + */
> > + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> > + io_destroy_group(iocg, iog);
> > +
> > + BUG_ON(!hlist_empty(&iocg->group_data));
> > +
>
> Hi Vivek,
>
> IMHO, free_css_id() needs to be called here.
>
Thanks. Sure, will do in next version.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-07 7:42 ` Gui Jianfeng
2009-05-07 8:05 ` Li Zefan
[not found] ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-08 12:45 ` Vivek Goyal
2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:45 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Thu, May 07, 2009 at 03:42:37PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > This patch enables hierarchical fair queuing in common layer. It is
> > controlled by config option CONFIG_GROUP_IOSCHED.
> ...
> > +}
> > +
> > +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> > +{
> > + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > + struct hlist_node *n, *tmp;
> > + struct io_group *iog;
> > +
> > + /*
> > + * Since we are destroying the cgroup, there are no more tasks
> > + * referencing it, and all the RCU grace periods that may have
> > + * referenced it are ended (as the destruction of the parent
> > + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> > + * anything else and we don't need any synchronization.
> > + */
> > + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> > + io_destroy_group(iocg, iog);
> > +
> > + BUG_ON(!hlist_empty(&iocg->group_data));
> > +
>
> Hi Vivek,
>
> IMHO, free_css_id() needs to be called here.
>
Thanks. Sure, will do in next version.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
[not found] ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-07 7:42 ` Gui Jianfeng
2009-05-08 21:09 ` Andrea Righi
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07 7:42 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> This patch enables hierarchical fair queuing in common layer. It is
> controlled by config option CONFIG_GROUP_IOSCHED.
...
> +}
> +
> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> + struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> + struct hlist_node *n, *tmp;
> + struct io_group *iog;
> +
> + /*
> + * Since we are destroying the cgroup, there are no more tasks
> + * referencing it, and all the RCU grace periods that may have
> + * referenced it are ended (as the destruction of the parent
> + * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> + * anything else and we don't need any synchronization.
> + */
> + hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> + io_destroy_group(iocg, iog);
> +
> + BUG_ON(!hlist_empty(&iocg->group_data));
> +
Hi Vivek,
IMHO, free_css_id() needs to be called here.
> + kfree(iocg);
> +}
> +
> +void io_disconnect_groups(struct elevator_queue *e)
> +{
> + struct hlist_node *pos, *n;
> + struct io_group *iog;
> + struct elv_fq_data *efqd = &e->efqd;
> +
> + hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
> + elv_data_node) {
> + hlist_del(&iog->elv_data_node);
> +
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
[not found] ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07 7:42 ` Gui Jianfeng
@ 2009-05-08 21:09 ` Andrea Righi
1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 21:09 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> +#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
> +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> + struct cftype *cftype, \
> + u64 val) \
> +{ \
> + struct io_cgroup *iocg; \
> + struct io_group *iog; \
> + struct hlist_node *n; \
> + \
> + if (val < (__MIN) || val > (__MAX)) \
> + return -EINVAL; \
> + \
> + if (!cgroup_lock_live_group(cgroup)) \
> + return -ENODEV; \
> + \
> + iocg = cgroup_to_io_cgroup(cgroup); \
> + \
> + spin_lock_irq(&iocg->lock); \
> + iocg->__VAR = (unsigned long)val; \
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> + iog->entity.new_##__VAR = (unsigned long)val; \
> + smp_wmb(); \
> + iog->entity.ioprio_changed = 1; \
> + } \
> + spin_unlock_irq(&iocg->lock); \
> + \
> + cgroup_unlock(); \
> + \
> + return 0; \
> +}
> +
> +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
A small fix: io.weight should be strictly greater than 0 if we don't
want to automatically trigger the BUG_ON(entity->weight == 0) in
bfq_calc_finish().
Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
block/elevator-fq.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..de25f44 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
return 0; \
}
-STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 ` Vivek Goyal
2009-05-07 7:42 ` Gui Jianfeng
[not found] ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 21:09 ` Andrea Righi
2009-05-08 21:17 ` Vivek Goyal
2009-05-08 21:17 ` Vivek Goyal
2 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 21:09 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
m-ikeda, akpm
On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> +#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
> +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> + struct cftype *cftype, \
> + u64 val) \
> +{ \
> + struct io_cgroup *iocg; \
> + struct io_group *iog; \
> + struct hlist_node *n; \
> + \
> + if (val < (__MIN) || val > (__MAX)) \
> + return -EINVAL; \
> + \
> + if (!cgroup_lock_live_group(cgroup)) \
> + return -ENODEV; \
> + \
> + iocg = cgroup_to_io_cgroup(cgroup); \
> + \
> + spin_lock_irq(&iocg->lock); \
> + iocg->__VAR = (unsigned long)val; \
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> + iog->entity.new_##__VAR = (unsigned long)val; \
> + smp_wmb(); \
> + iog->entity.ioprio_changed = 1; \
> + } \
> + spin_unlock_irq(&iocg->lock); \
> + \
> + cgroup_unlock(); \
> + \
> + return 0; \
> +}
> +
> +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
A small fix: io.weight should be strictly greater than 0 if we don't
want to automatically trigger the BUG_ON(entity->weight == 0) in
bfq_calc_finish().
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/elevator-fq.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..de25f44 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
return 0; \
}
-STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-08 21:09 ` Andrea Righi
@ 2009-05-08 21:17 ` Vivek Goyal
2009-05-08 21:17 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:17 UTC (permalink / raw)
To: Andrea Righi
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
m-ikeda, akpm
On Fri, May 08, 2009 at 11:09:37PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> > +#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
> > +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> > + struct cftype *cftype, \
> > + u64 val) \
> > +{ \
> > + struct io_cgroup *iocg; \
> > + struct io_group *iog; \
> > + struct hlist_node *n; \
> > + \
> > + if (val < (__MIN) || val > (__MAX)) \
> > + return -EINVAL; \
> > + \
> > + if (!cgroup_lock_live_group(cgroup)) \
> > + return -ENODEV; \
> > + \
> > + iocg = cgroup_to_io_cgroup(cgroup); \
> > + \
> > + spin_lock_irq(&iocg->lock); \
> > + iocg->__VAR = (unsigned long)val; \
> > + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> > + iog->entity.new_##__VAR = (unsigned long)val; \
> > + smp_wmb(); \
> > + iog->entity.ioprio_changed = 1; \
> > + } \
> > + spin_unlock_irq(&iocg->lock); \
> > + \
> > + cgroup_unlock(); \
> > + \
> > + return 0; \
> > +}
> > +
> > +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
>
> A small fix: io.weight should be strictly greater than 0 if we don't
> want to automatically trigger the BUG_ON(entity->weight == 0) in
> bfq_calc_finish().
>
> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Thanks Andrea. It worked previously as in previous version it was
io.ioprio and prio 0 was allowed and we calculated weights from priority.
Will include the fix in next version.
Thanks
Vivek
> ---
> block/elevator-fq.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..de25f44 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> return 0; \
> }
>
> -STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> +STORE_FUNCTION(weight, 1, WEIGHT_MAX);
> STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
> #undef STORE_FUNCTION
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
2009-05-08 21:09 ` Andrea Righi
2009-05-08 21:17 ` Vivek Goyal
@ 2009-05-08 21:17 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:17 UTC (permalink / raw)
To: Andrea Righi
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On Fri, May 08, 2009 at 11:09:37PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> > +#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
> > +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> > + struct cftype *cftype, \
> > + u64 val) \
> > +{ \
> > + struct io_cgroup *iocg; \
> > + struct io_group *iog; \
> > + struct hlist_node *n; \
> > + \
> > + if (val < (__MIN) || val > (__MAX)) \
> > + return -EINVAL; \
> > + \
> > + if (!cgroup_lock_live_group(cgroup)) \
> > + return -ENODEV; \
> > + \
> > + iocg = cgroup_to_io_cgroup(cgroup); \
> > + \
> > + spin_lock_irq(&iocg->lock); \
> > + iocg->__VAR = (unsigned long)val; \
> > + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> > + iog->entity.new_##__VAR = (unsigned long)val; \
> > + smp_wmb(); \
> > + iog->entity.ioprio_changed = 1; \
> > + } \
> > + spin_unlock_irq(&iocg->lock); \
> > + \
> > + cgroup_unlock(); \
> > + \
> > + return 0; \
> > +}
> > +
> > +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
>
> A small fix: io.weight should be strictly greater than 0 if we don't
> want to automatically trigger the BUG_ON(entity->weight == 0) in
> bfq_calc_finish().
>
> Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Thanks Andrea. It worked previously as in previous version it was
io.ioprio and prio 0 was allowed and we calculated weights from priority.
Will include the fix in next version.
Thanks
Vivek
> ---
> block/elevator-fq.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..de25f44 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> return 0; \
> }
>
> -STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> +STORE_FUNCTION(weight, 1, WEIGHT_MAX);
> STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
> #undef STORE_FUNCTION
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (8 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (27 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
Make cfq hierarhical.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 8 ++++++++
block/cfq-iosched.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
init/Kconfig | 2 +-
3 files changed, 57 insertions(+), 1 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
working environment, suitable for desktop systems.
This is the default I/O scheduler.
+config IOSCHED_CFQ_HIER
+ bool "CFQ Hierarchical Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in cfq.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
ioc->ioprio_changed = 0;
}
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ struct io_group *iog, *__iog;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ iog = io_lookup_io_group_current(q);
+
+ if (async_cfqq != NULL) {
+ __iog = cfqq_to_io_group(async_cfqq);
+
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue(async_cfqq);
+ }
+ }
+
+ if (sync_cfqq != NULL) {
+ __iog = cfqq_to_io_group(sync_cfqq);
+ if (iog != __iog)
+ io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_IOSCHED_CFQ_HIER */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
size is 4096bytes, 512k per 1Gbytes of swap.
config GROUP_IOSCHED
- bool "Group IO Scheduler"
+ bool
depends on CGROUPS && ELV_FAIR_QUEUING
default n
---help---
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (9 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
` (26 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
Make cfq hierarhical.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 8 ++++++++
block/cfq-iosched.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
init/Kconfig | 2 +-
3 files changed, 57 insertions(+), 1 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
working environment, suitable for desktop systems.
This is the default I/O scheduler.
+config IOSCHED_CFQ_HIER
+ bool "CFQ Hierarchical Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in cfq.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
ioc->ioprio_changed = 0;
}
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ struct io_group *iog, *__iog;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ iog = io_lookup_io_group_current(q);
+
+ if (async_cfqq != NULL) {
+ __iog = cfqq_to_io_group(async_cfqq);
+
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue(async_cfqq);
+ }
+ }
+
+ if (sync_cfqq != NULL) {
+ __iog = cfqq_to_io_group(sync_cfqq);
+ if (iog != __iog)
+ io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_IOSCHED_CFQ_HIER */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
size is 4096bytes, 512k per 1Gbytes of swap.
config GROUP_IOSCHED
- bool "Group IO Scheduler"
+ bool
depends on CGROUPS && ELV_FAIR_QUEUING
default n
---help---
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (10 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (25 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o This patch exports some statistics through cgroup interface. Two of the
statistics currently exported are actual disk time assigned to the cgroup
and actual number of sectors dispatched to disk on behalf of this cgroup.
o Currently these numbers are aggregate. That means it is for all the tasks
in that cgroup on all the disks. Later may be it will help to get per
disk statistics also.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 7 ++++
2 files changed, 106 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
return entity;
}
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+ bfq_service_t nr_sectors)
{
struct io_service_tree *st;
for_each_entity(entity) {
st = io_entity_service_tree(entity);
entity->service += served;
+ entity->total_service += served;
+ entity->total_sector_service += nr_sectors;
BUG_ON(st->wsum == 0);
st->vtime += bfq_delta(served, st->wsum);
bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_time = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_time += iog->entity.total_service;
+ }
+ rcu_read_unlock();
+
+ return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_sectors = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_sectors += iog->entity.total_sector_service;
+ }
+ rcu_read_unlock();
+
+ return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = calculate_aggr_disk_sectors(iocg);
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
/**
* bfq_group_chain_alloc - allocate a chain of groups.
* @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
.read_u64 = io_cgroup_ioprio_class_read,
.write_u64 = io_cgroup_ioprio_class_write,
},
+ {
+ .name = "disk_time",
+ .read_u64 = io_cgroup_disk_time_read,
+ },
+ {
+ .name = "disk_sectors",
+ .read_u64 = io_cgroup_disk_sectors_read,
+ },
};
int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
- entity_served(&ioq->entity, served);
+ entity_served(&ioq->entity, served, ioq->nr_sectors);
}
/* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
unsigned short ioprio_class, new_ioprio_class;
int ioprio_changed;
+
+ /*
+ * Keep track of total service received by this entity. Keep the
+ * stats both for time slices and number of sectors dispatched
+ */
+ unsigned long total_service;
+ unsigned long total_sector_service;
};
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (11 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-13 2:39 ` Gui Jianfeng
[not found] ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
` (24 subsequent siblings)
37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o This patch exports some statistics through cgroup interface. Two of the
statistics currently exported are actual disk time assigned to the cgroup
and actual number of sectors dispatched to disk on behalf of this cgroup.
o Currently these numbers are aggregate. That means it is for all the tasks
in that cgroup on all the disks. Later may be it will help to get per
disk statistics also.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 7 ++++
2 files changed, 106 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
return entity;
}
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+ bfq_service_t nr_sectors)
{
struct io_service_tree *st;
for_each_entity(entity) {
st = io_entity_service_tree(entity);
entity->service += served;
+ entity->total_service += served;
+ entity->total_sector_service += nr_sectors;
BUG_ON(st->wsum == 0);
st->vtime += bfq_delta(served, st->wsum);
bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_time = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_time += iog->entity.total_service;
+ }
+ rcu_read_unlock();
+
+ return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_sectors = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_sectors += iog->entity.total_sector_service;
+ }
+ rcu_read_unlock();
+
+ return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = calculate_aggr_disk_sectors(iocg);
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
/**
* bfq_group_chain_alloc - allocate a chain of groups.
* @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
.read_u64 = io_cgroup_ioprio_class_read,
.write_u64 = io_cgroup_ioprio_class_write,
},
+ {
+ .name = "disk_time",
+ .read_u64 = io_cgroup_disk_time_read,
+ },
+ {
+ .name = "disk_sectors",
+ .read_u64 = io_cgroup_disk_sectors_read,
+ },
};
int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
- entity_served(&ioq->entity, served);
+ entity_served(&ioq->entity, served, ioq->nr_sectors);
}
/* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
unsigned short ioprio_class, new_ioprio_class;
int ioprio_changed;
+
+ /*
+ * Keep track of total service received by this entity. Keep the
+ * stats both for time slices and number of sectors dispatched
+ */
+ unsigned long total_service;
+ unsigned long total_sector_service;
};
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-13 2:39 ` Gui Jianfeng
[not found] ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 14:51 ` Vivek Goyal
[not found] ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13 2:39 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
>
> +/*
> + * traverse through all the io_groups associated with this cgroup and calculate
> + * the aggr disk time received by all the groups on respective disks.
> + */
> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +{
> + struct io_group *iog;
> + struct hlist_node *n;
> + u64 disk_time = 0;
> +
> + rcu_read_lock();
This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
that the caller already holds the iocg->lock.
> + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> + /*
> + * There might be groups which are not functional and
> + * waiting to be reclaimed upon cgoup deletion.
> + */
> + if (rcu_dereference(iog->key))
> + disk_time += iog->entity.total_service;
> + }
> + rcu_read_unlock();
> +
> + return disk_time;
> +}
> +
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
[not found] ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:51 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:51 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >
> > +/*
> > + * traverse through all the io_groups associated with this cgroup and calculate
> > + * the aggr disk time received by all the groups on respective disks.
> > + */
> > +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> > +{
> > + struct io_group *iog;
> > + struct hlist_node *n;
> > + u64 disk_time = 0;
> > +
> > + rcu_read_lock();
>
> This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
> that the caller already holds the iocg->lock.
>
Or can we get rid of requirement of iocg_lock here and just read the io
group data under rcu read lock? Actually I am wondering why do we require
an iocg_lock here. We are not modifying the rcu protected list. We are
just traversing through it and reading the data.
Thanks
Vivek
> > + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> > + /*
> > + * There might be groups which are not functional and
> > + * waiting to be reclaimed upon cgoup deletion.
> > + */
> > + if (rcu_dereference(iog->key))
> > + disk_time += iog->entity.total_service;
> > + }
> > + rcu_read_unlock();
> > +
> > + return disk_time;
> > +}
> > +
>
> --
> Regards
> Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
2009-05-13 2:39 ` Gui Jianfeng
[not found] ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:51 ` Vivek Goyal
2009-05-14 7:53 ` Gui Jianfeng
[not found] ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:51 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >
> > +/*
> > + * traverse through all the io_groups associated with this cgroup and calculate
> > + * the aggr disk time received by all the groups on respective disks.
> > + */
> > +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> > +{
> > + struct io_group *iog;
> > + struct hlist_node *n;
> > + u64 disk_time = 0;
> > +
> > + rcu_read_lock();
>
> This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
> that the caller already holds the iocg->lock.
>
Or can we get rid of requirement of iocg_lock here and just read the io
group data under rcu read lock? Actually I am wondering why do we require
an iocg_lock here. We are not modifying the rcu protected list. We are
just traversing through it and reading the data.
Thanks
Vivek
> > + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> > + /*
> > + * There might be groups which are not functional and
> > + * waiting to be reclaimed upon cgoup deletion.
> > + */
> > + if (rcu_dereference(iog->key))
> > + disk_time += iog->entity.total_service;
> > + }
> > + rcu_read_unlock();
> > +
> > + return disk_time;
> > +}
> > +
>
> --
> Regards
> Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
2009-05-13 14:51 ` Vivek Goyal
@ 2009-05-14 7:53 ` Gui Jianfeng
[not found] ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 7:53 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>
>>> +/*
>>> + * traverse through all the io_groups associated with this cgroup and calculate
>>> + * the aggr disk time received by all the groups on respective disks.
>>> + */
>>> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
>>> +{
>>> + struct io_group *iog;
>>> + struct hlist_node *n;
>>> + u64 disk_time = 0;
>>> +
>>> + rcu_read_lock();
>> This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>> that the caller already holds the iocg->lock.
>>
>
> Or can we get rid of requirement of iocg_lock here and just read the io
> group data under rcu read lock? Actually I am wondering why do we require
> an iocg_lock here. We are not modifying the rcu protected list. We are
> just traversing through it and reading the data.
Yes, i think removing the iocg->lock from caller(io_cgroup_disk_time_read())
is a better choice.
>
> Thanks
> Vivek
>
>>> + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>>> + /*
>>> + * There might be groups which are not functional and
>>> + * waiting to be reclaimed upon cgoup deletion.
>>> + */
>>> + if (rcu_dereference(iog->key))
>>> + disk_time += iog->entity.total_service;
>>> + }
>>> + rcu_read_unlock();
>>> +
>>> + return disk_time;
>>> +}
>>> +
>> --
>> Regards
>> Gui Jianfeng
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
[not found] ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 7:53 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 7:53 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>
>>> +/*
>>> + * traverse through all the io_groups associated with this cgroup and calculate
>>> + * the aggr disk time received by all the groups on respective disks.
>>> + */
>>> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
>>> +{
>>> + struct io_group *iog;
>>> + struct hlist_node *n;
>>> + u64 disk_time = 0;
>>> +
>>> + rcu_read_lock();
>> This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>> that the caller already holds the iocg->lock.
>>
>
> Or can we get rid of requirement of iocg_lock here and just read the io
> group data under rcu read lock? Actually I am wondering why do we require
> an iocg_lock here. We are not modifying the rcu protected list. We are
> just traversing through it and reading the data.
Yes, i think removing the iocg->lock from caller(io_cgroup_disk_time_read())
is a better choice.
>
> Thanks
> Vivek
>
>>> + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>>> + /*
>>> + * There might be groups which are not functional and
>>> + * waiting to be reclaimed upon cgoup deletion.
>>> + */
>>> + if (rcu_dereference(iog->key))
>>> + disk_time += iog->entity.total_service;
>>> + }
>>> + rcu_read_unlock();
>>> +
>>> + return disk_time;
>>> +}
>>> +
>> --
>> Regards
>> Gui Jianfeng
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
[not found] ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-13 2:39 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13 2:39 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
>
> +/*
> + * traverse through all the io_groups associated with this cgroup and calculate
> + * the aggr disk time received by all the groups on respective disks.
> + */
> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +{
> + struct io_group *iog;
> + struct hlist_node *n;
> + u64 disk_time = 0;
> +
> + rcu_read_lock();
This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
that the caller already holds the iocg->lock.
> + hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> + /*
> + * There might be groups which are not functional and
> + * waiting to be reclaimed upon cgoup deletion.
> + */
> + if (rcu_dereference(iog->key))
> + disk_time += iog->entity.total_service;
> + }
> + rcu_read_unlock();
> +
> + return disk_time;
> +}
> +
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (12 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (23 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o When a sync queue expires, in many cases it might be empty and then
it will be deleted from the active tree. This will lead to a scenario
where out of two competing queues, only one is on the tree and when a
new queue is selected, vtime jump takes place and we don't see services
provided in proportion to weight.
o In general this is a fundamental problem with fairness of sync queues
where queues are not continuously backlogged. Looks like idling is
only solution to make sure such kind of queues can get some decent amount
of disk bandwidth in the face of competion from continusouly backlogged
queues. But excessive idling has potential to reduce performance on SSD
and disks with commnad queuing.
o This patch experiments with waiting for next request to come before a
queue is expired after it has consumed its time slice. This can ensure
more accurate fairness numbers in some cases.
o Introduced a tunable "fairness". If set, io-controller will put more
focus on getting fairness right than getting throughput right.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-sysfs.c | 7 +++
block/elevator-fq.c | 117 +++++++++++++++++++++++++++++++++++++++++++++-----
block/elevator-fq.h | 12 +++++
3 files changed, 124 insertions(+), 12 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
.show = elv_slice_async_show,
.store = elv_slice_async_store,
};
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+ .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_fairness_show,
+ .store = elv_fairness_store,
+};
#endif
static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
&queue_slice_idle_entry.attr,
&queue_slice_sync_entry.attr,
&queue_slice_async_entry.attr,
+ &queue_fairness_entry.attr,
#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->fairness;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->fairness = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
/* Functions to show and store elv_idle_slice value through sysfs */
ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
{
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
* io scheduler if it wants to disable idling based on additional
* considrations like seek pattern.
*/
- if (enable_idle) {
+ if (enable_idle && !efqd->fairness) {
if (eq->ops->elevator_update_idle_window_fn)
enable_idle = eq->ops->elevator_update_idle_window_fn(
eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
assert_spin_locked(q->queue_lock);
elv_log_ioq(efqd, ioq, "slice expired");
- if (elv_ioq_wait_request(ioq))
+ if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
del_timer(&efqd->idle_slice_timer);
elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_wait_busy(ioq);
/*
* if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
* has other work pending, don't risk delaying until the
* idle timer unplug to continue working.
*/
- if (elv_ioq_wait_request(ioq)) {
+ if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
efqd->busy_queues > 1) {
del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
}
elv_mark_ioq_must_dispatch(ioq);
}
+
+ /*
+ * If we were waiting for a request on this queue, wait is
+ * done. Schedule the next dispatch
+ */
+ if (elv_ioq_wait_busy(ioq)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_ioq_wait_busy(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_schedule_dispatch(q);
+ }
} else if (elv_should_preempt(q, ioq, rq)) {
/*
* not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
if (ioq) {
+ if (elv_ioq_wait_busy(ioq))
+ goto expire;
+
/*
* We saw a request before the queue expired, let it through
*/
@@ -2631,7 +2684,7 @@ out_cont:
spin_unlock_irqrestore(q->queue_lock, flags);
}
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk_queue_nonrot(q) && efqd->hw_tag)
+ if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
return;
/*
- * still requests with the driver, don't idle
+ * idle is disabled, either manually or by past process history
*/
- if (efqd->rq_in_driver)
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
return;
/*
- * idle is disabled, either manually or by past process history
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
*/
- if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+ elv_mark_ioq_wait_busy(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+ return;
+ }
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver && !efqd->fairness)
return;
/*
* may be iosched got its own idling logic. In that case io
* schduler will take care of arming the timer, if need be.
*/
- if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
q->elevator->ops->elevator_arm_slice_timer_fn(q,
ioq->sched_queue);
} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto expire;
}
+ /* We are waiting for this queue to become busy before it expires.*/
+ if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_ioq_set_prio_slice(q, ioq);
elv_clear_ioq_slice_new(ioq);
}
+
+ if (elv_ioq_class_idle(ioq)) {
+ elv_ioq_slice_expired(q);
+ goto done;
+ }
+
+ if (efqd->fairness && sync && !ioq->nr_queued) {
+ /*
+ * If fairness is enabled, wait for one extra idle
+ * period in the hope that this queue will get
+ * backlogged again
+ */
+ if (elv_ioq_slice_used(ioq))
+ elv_ioq_arm_slice_timer(q, 1);
+ else
+ elv_ioq_arm_slice_timer(q, 0);
+ goto done;
+ }
+
/*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ if (elv_ioq_slice_used(ioq))
elv_ioq_slice_expired(q);
else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
- elv_ioq_arm_slice_timer(q);
+ elv_ioq_arm_slice_timer(q, 0);
}
+done:
if (!efqd->rq_in_driver)
elv_schedule_dispatch(q);
}
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
efqd->elv_slice_idle = elv_slice_idle;
efqd->hw_tag = 1;
+ /* For the time being keep fairness enabled by default */
+ efqd->fairness = 1;
return 0;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
unsigned long long rate_sampling_start; /*sampling window start jifies*/
/* number of sectors finished io during current sampling window */
unsigned long rate_sectors_current;
+
+ /*
+ * If set to 1, will disable many optimizations done for boost
+ * throughput and focus more on providing fairness for sync
+ * queues.
+ */
+ int fairness;
};
extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_wait_busy, /* wait for this queue to get busy */
ELV_QUEUE_FLAG_NR,
};
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
static inline struct io_service_tree *
io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count);
/* Functions used by elevator.c */
extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (13 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-13 15:00 ` Vivek Goyal
` (3 more replies)
2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
` (22 subsequent siblings)
37 siblings, 4 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o When a sync queue expires, in many cases it might be empty and then
it will be deleted from the active tree. This will lead to a scenario
where out of two competing queues, only one is on the tree and when a
new queue is selected, vtime jump takes place and we don't see services
provided in proportion to weight.
o In general this is a fundamental problem with fairness of sync queues
where queues are not continuously backlogged. Looks like idling is
only solution to make sure such kind of queues can get some decent amount
of disk bandwidth in the face of competion from continusouly backlogged
queues. But excessive idling has potential to reduce performance on SSD
and disks with commnad queuing.
o This patch experiments with waiting for next request to come before a
queue is expired after it has consumed its time slice. This can ensure
more accurate fairness numbers in some cases.
o Introduced a tunable "fairness". If set, io-controller will put more
focus on getting fairness right than getting throughput right.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-sysfs.c | 7 +++
block/elevator-fq.c | 117 +++++++++++++++++++++++++++++++++++++++++++++-----
block/elevator-fq.h | 12 +++++
3 files changed, 124 insertions(+), 12 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
.show = elv_slice_async_show,
.store = elv_slice_async_store,
};
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+ .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_fairness_show,
+ .store = elv_fairness_store,
+};
#endif
static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
&queue_slice_idle_entry.attr,
&queue_slice_sync_entry.attr,
&queue_slice_async_entry.attr,
+ &queue_fairness_entry.attr,
#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->fairness;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->fairness = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
/* Functions to show and store elv_idle_slice value through sysfs */
ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
{
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
* io scheduler if it wants to disable idling based on additional
* considrations like seek pattern.
*/
- if (enable_idle) {
+ if (enable_idle && !efqd->fairness) {
if (eq->ops->elevator_update_idle_window_fn)
enable_idle = eq->ops->elevator_update_idle_window_fn(
eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
assert_spin_locked(q->queue_lock);
elv_log_ioq(efqd, ioq, "slice expired");
- if (elv_ioq_wait_request(ioq))
+ if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
del_timer(&efqd->idle_slice_timer);
elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_wait_busy(ioq);
/*
* if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
* has other work pending, don't risk delaying until the
* idle timer unplug to continue working.
*/
- if (elv_ioq_wait_request(ioq)) {
+ if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
efqd->busy_queues > 1) {
del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
}
elv_mark_ioq_must_dispatch(ioq);
}
+
+ /*
+ * If we were waiting for a request on this queue, wait is
+ * done. Schedule the next dispatch
+ */
+ if (elv_ioq_wait_busy(ioq)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_ioq_wait_busy(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_schedule_dispatch(q);
+ }
} else if (elv_should_preempt(q, ioq, rq)) {
/*
* not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
if (ioq) {
+ if (elv_ioq_wait_busy(ioq))
+ goto expire;
+
/*
* We saw a request before the queue expired, let it through
*/
@@ -2631,7 +2684,7 @@ out_cont:
spin_unlock_irqrestore(q->queue_lock, flags);
}
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk_queue_nonrot(q) && efqd->hw_tag)
+ if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
return;
/*
- * still requests with the driver, don't idle
+ * idle is disabled, either manually or by past process history
*/
- if (efqd->rq_in_driver)
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
return;
/*
- * idle is disabled, either manually or by past process history
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
*/
- if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+ elv_mark_ioq_wait_busy(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+ return;
+ }
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver && !efqd->fairness)
return;
/*
* may be iosched got its own idling logic. In that case io
* schduler will take care of arming the timer, if need be.
*/
- if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
q->elevator->ops->elevator_arm_slice_timer_fn(q,
ioq->sched_queue);
} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto expire;
}
+ /* We are waiting for this queue to become busy before it expires.*/
+ if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_ioq_set_prio_slice(q, ioq);
elv_clear_ioq_slice_new(ioq);
}
+
+ if (elv_ioq_class_idle(ioq)) {
+ elv_ioq_slice_expired(q);
+ goto done;
+ }
+
+ if (efqd->fairness && sync && !ioq->nr_queued) {
+ /*
+ * If fairness is enabled, wait for one extra idle
+ * period in the hope that this queue will get
+ * backlogged again
+ */
+ if (elv_ioq_slice_used(ioq))
+ elv_ioq_arm_slice_timer(q, 1);
+ else
+ elv_ioq_arm_slice_timer(q, 0);
+ goto done;
+ }
+
/*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ if (elv_ioq_slice_used(ioq))
elv_ioq_slice_expired(q);
else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
- elv_ioq_arm_slice_timer(q);
+ elv_ioq_arm_slice_timer(q, 0);
}
+done:
if (!efqd->rq_in_driver)
elv_schedule_dispatch(q);
}
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
efqd->elv_slice_idle = elv_slice_idle;
efqd->hw_tag = 1;
+ /* For the time being keep fairness enabled by default */
+ efqd->fairness = 1;
return 0;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
unsigned long long rate_sampling_start; /*sampling window start jifies*/
/* number of sectors finished io during current sampling window */
unsigned long rate_sectors_current;
+
+ /*
+ * If set to 1, will disable many optimizations done for boost
+ * throughput and focus more on providing fairness for sync
+ * queues.
+ */
+ int fairness;
};
extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_wait_busy, /* wait for this queue to get busy */
ELV_QUEUE_FLAG_NR,
};
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
static inline struct io_service_tree *
io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count);
/* Functions used by elevator.c */
extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-13 15:00 ` Vivek Goyal
2009-05-13 15:00 ` Vivek Goyal
` (2 subsequent siblings)
3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm
On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
> it will be deleted from the active tree. This will lead to a scenario
> where out of two competing queues, only one is on the tree and when a
> new queue is selected, vtime jump takes place and we don't see services
> provided in proportion to weight.
>
> o In general this is a fundamental problem with fairness of sync queues
> where queues are not continuously backlogged. Looks like idling is
> only solution to make sure such kind of queues can get some decent amount
> of disk bandwidth in the face of competion from continusouly backlogged
> queues. But excessive idling has potential to reduce performance on SSD
> and disks with commnad queuing.
>
> o This patch experiments with waiting for next request to come before a
> queue is expired after it has consumed its time slice. This can ensure
> more accurate fairness numbers in some cases.
>
> o Introduced a tunable "fairness". If set, io-controller will put more
> focus on getting fairness right than getting throughput right.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
Following is a fix which should go here. This patch helps me get much
better fairness numbers for sync queues.
o Fix a window where a queue can be expired without doing busy wait for
next request. This fix allows better fairness number for sync queues.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c 2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c 2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
/*
* The active queue has run out of time, expire it and select new.
*/
- if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
- goto expire;
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+ /*
+ * Queue has used up its slice. Wait busy is not on otherwise
+ * we wouldn't have been here. There is a chance that after
+ * slice expiry no request from the queue completed hence
+ * wait busy timer could not be turned on. If that's the case
+ * don't expire the queue yet. Next request completion from
+ * the queue will arm the wait busy timer.
+ */
+ if (efqd->fairness && !ioq->nr_queued
+ && elv_ioq_nr_dispatched(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ } else
+ goto expire;
+ }
/*
* If we have a RT cfqq waiting, then we pre-empt the current non-rt
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-05-05 19:58 ` Vivek Goyal
2009-05-13 15:00 ` Vivek Goyal
@ 2009-05-13 15:00 ` Vivek Goyal
2009-06-09 7:56 ` Gui Jianfeng
[not found] ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: akpm
On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
> it will be deleted from the active tree. This will lead to a scenario
> where out of two competing queues, only one is on the tree and when a
> new queue is selected, vtime jump takes place and we don't see services
> provided in proportion to weight.
>
> o In general this is a fundamental problem with fairness of sync queues
> where queues are not continuously backlogged. Looks like idling is
> only solution to make sure such kind of queues can get some decent amount
> of disk bandwidth in the face of competion from continusouly backlogged
> queues. But excessive idling has potential to reduce performance on SSD
> and disks with commnad queuing.
>
> o This patch experiments with waiting for next request to come before a
> queue is expired after it has consumed its time slice. This can ensure
> more accurate fairness numbers in some cases.
>
> o Introduced a tunable "fairness". If set, io-controller will put more
> focus on getting fairness right than getting throughput right.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
Following is a fix which should go here. This patch helps me get much
better fairness numbers for sync queues.
o Fix a window where a queue can be expired without doing busy wait for
next request. This fix allows better fairness number for sync queues.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c 2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c 2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
/*
* The active queue has run out of time, expire it and select new.
*/
- if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
- goto expire;
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+ /*
+ * Queue has used up its slice. Wait busy is not on otherwise
+ * we wouldn't have been here. There is a chance that after
+ * slice expiry no request from the queue completed hence
+ * wait busy timer could not be turned on. If that's the case
+ * don't expire the queue yet. Next request completion from
+ * the queue will arm the wait busy timer.
+ */
+ if (efqd->fairness && !ioq->nr_queued
+ && elv_ioq_nr_dispatched(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ } else
+ goto expire;
+ }
/*
* If we have a RT cfqq waiting, then we pre-empt the current non-rt
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-05-05 19:58 ` Vivek Goyal
2009-05-13 15:00 ` Vivek Goyal
2009-05-13 15:00 ` Vivek Goyal
@ 2009-06-09 7:56 ` Gui Jianfeng
2009-06-09 17:51 ` Vivek Goyal
[not found] ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
[not found] ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
3 siblings, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-09 7:56 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> + size_t count)
> +{
> + struct elv_fq_data *efqd;
> + unsigned int data;
> + unsigned long flags;
> +
> + char *p = (char *)name;
> +
> + data = simple_strtoul(p, &p, 10);
> +
> + if (data < 0)
> + data = 0;
> + else if (data > INT_MAX)
> + data = INT_MAX;
Hi Vivek,
data might overflow on 64 bit systems. In addition, since "fairness" is nothing
more than a switch, just let it be.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/elevator-fq.c | 10 +++++-----
block/elevator-fq.h | 2 +-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 655162b..42d4279 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2605,7 +2605,7 @@ static inline int is_root_group_ioq(struct request_queue *q,
ssize_t elv_fairness_show(struct request_queue *q, char *name)
{
struct elv_fq_data *efqd;
- unsigned int data;
+ unsigned long data;
unsigned long flags;
spin_lock_irqsave(q->queue_lock, flags);
@@ -2619,17 +2619,17 @@ ssize_t elv_fairness_store(struct request_queue *q, const char *name,
size_t count)
{
struct elv_fq_data *efqd;
- unsigned int data;
+ unsigned long data;
unsigned long flags;
char *p = (char *)name;
data = simple_strtoul(p, &p, 10);
- if (data < 0)
+ if (!data)
data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
+ else
+ data = 1;
spin_lock_irqsave(q->queue_lock, flags);
efqd = &q->elevator->efqd;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b2bb11a..4fe843a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -359,7 +359,7 @@ struct elv_fq_data {
* throughput and focus more on providing fairness for sync
* queues.
*/
- int fairness;
+ unsigned long fairness;
};
extern int elv_slice_idle;
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-06-09 7:56 ` Gui Jianfeng
@ 2009-06-09 17:51 ` Vivek Goyal
[not found] ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > + size_t count)
> > +{
> > + struct elv_fq_data *efqd;
> > + unsigned int data;
> > + unsigned long flags;
> > +
> > + char *p = (char *)name;
> > +
> > + data = simple_strtoul(p, &p, 10);
> > +
> > + if (data < 0)
> > + data = 0;
> > + else if (data > INT_MAX)
> > + data = INT_MAX;
>
> Hi Vivek,
>
> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> more than a switch, just let it be.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
Hi Gui,
How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.
Thanks
Vivek
o Previously common layer elevator parameters were appearing as request
queue parameters in sysfs. But actually these are io scheduler parameters
in hiearchical mode. Fix it.
o Use macros to define multiple sysfs C functions doing the same thing. Code
borrowed from CFQ. Helps reduce the number of lines of by 140.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/as-iosched.c | 5
block/blk-sysfs.c | 39 -------
block/cfq-iosched.c | 5
block/deadline-iosched.c | 5
block/elevator-fq.c | 245 +++++++++++------------------------------------
block/elevator-fq.h | 26 ++--
block/noop-iosched.c | 10 +
7 files changed, 97 insertions(+), 238 deletions(-)
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h 2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
/**
* struct bfq_service_tree - per ioprio_class service tree.
* @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
* throughput and focus more on providing fairness for sync
* queues.
*/
- int fairness;
+ unsigned int fairness;
int only_root_group;
};
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
#endif /* GROUP_IOSCHED */
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
const char *name, size_t count);
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
size_t count);
/* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c 2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->fairness;
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->fairness = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_async_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_async_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[1];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
+ return sprintf(page, "%d\n", var);
}
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
- size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[1] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
+ char *p = (char *) page;
+ *var = simple_strtoul(p, &p, 10);
return count;
}
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[0];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[0] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return elv_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data; \
+ int ret = elv_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
void elv_schedule_dispatch(struct request_queue *q)
{
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
.store = queue_iostats_store,
};
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
- .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_idle_show,
- .store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
- .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_async_slice_idle_show,
- .store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
- .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_sync_show,
- .store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
- .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_async_show,
- .store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
- .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
- .show = elv_fairness_show,
- .store = elv_fairness_store,
-};
-#endif
-
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
#ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
- &queue_slice_idle_entry.attr,
- &queue_async_slice_idle_entry.attr,
- &queue_slice_sync_entry.attr,
- &queue_slice_async_entry.attr,
- &queue_fairness_entry.attr,
-#endif
NULL,
};
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
CFQ_ATTR(slice_async_rq),
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(async_slice_idle),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(slice_async),
__ATTR_NULL
};
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
AS_ATTR(antic_expire),
AS_ATTR(read_batch_expire),
AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
DD_ATTR(writes_starved),
DD_ATTR(front_merges),
DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
kfree(nq);
}
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+ __ATTR_NULL
+};
+#endif
+
static struct elevator_type elevator_noop = {
.ops = {
.elevator_merge_req_fn = noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
},
#ifdef CONFIG_IOSCHED_NOOP_HIER
.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+ .elevator_attrs = noop_attrs,
#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-09 17:51 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
fchecconi, s-uchida, containers, linux-kernel, akpm,
righi.andrea
On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > + size_t count)
> > +{
> > + struct elv_fq_data *efqd;
> > + unsigned int data;
> > + unsigned long flags;
> > +
> > + char *p = (char *)name;
> > +
> > + data = simple_strtoul(p, &p, 10);
> > +
> > + if (data < 0)
> > + data = 0;
> > + else if (data > INT_MAX)
> > + data = INT_MAX;
>
> Hi Vivek,
>
> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> more than a switch, just let it be.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
Hi Gui,
How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.
Thanks
Vivek
o Previously common layer elevator parameters were appearing as request
queue parameters in sysfs. But actually these are io scheduler parameters
in hiearchical mode. Fix it.
o Use macros to define multiple sysfs C functions doing the same thing. Code
borrowed from CFQ. Helps reduce the number of lines of by 140.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/as-iosched.c | 5
block/blk-sysfs.c | 39 -------
block/cfq-iosched.c | 5
block/deadline-iosched.c | 5
block/elevator-fq.c | 245 +++++++++++------------------------------------
block/elevator-fq.h | 26 ++--
block/noop-iosched.c | 10 +
7 files changed, 97 insertions(+), 238 deletions(-)
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h 2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
/**
* struct bfq_service_tree - per ioprio_class service tree.
* @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
* throughput and focus more on providing fairness for sync
* queues.
*/
- int fairness;
+ unsigned int fairness;
int only_root_group;
};
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
#endif /* GROUP_IOSCHED */
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
const char *name, size_t count);
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
size_t count);
/* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c 2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->fairness;
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->fairness = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_async_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_async_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[1];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
+ return sprintf(page, "%d\n", var);
}
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
- size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[1] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
+ char *p = (char *) page;
+ *var = simple_strtoul(p, &p, 10);
return count;
}
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[0];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[0] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return elv_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data; \
+ int ret = elv_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
void elv_schedule_dispatch(struct request_queue *q)
{
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
.store = queue_iostats_store,
};
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
- .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_idle_show,
- .store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
- .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_async_slice_idle_show,
- .store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
- .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_sync_show,
- .store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
- .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_async_show,
- .store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
- .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
- .show = elv_fairness_show,
- .store = elv_fairness_store,
-};
-#endif
-
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
#ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
- &queue_slice_idle_entry.attr,
- &queue_async_slice_idle_entry.attr,
- &queue_slice_sync_entry.attr,
- &queue_slice_async_entry.attr,
- &queue_fairness_entry.attr,
-#endif
NULL,
};
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
CFQ_ATTR(slice_async_rq),
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(async_slice_idle),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(slice_async),
__ATTR_NULL
};
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
AS_ATTR(antic_expire),
AS_ATTR(read_batch_expire),
AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
DD_ATTR(writes_starved),
DD_ATTR(front_merges),
DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
kfree(nq);
}
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+ __ATTR_NULL
+};
+#endif
+
static struct elevator_type elevator_noop = {
.ops = {
.elevator_merge_req_fn = noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
},
#ifdef CONFIG_IOSCHED_NOOP_HIER
.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+ .elevator_attrs = noop_attrs,
#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090609175131.GB13476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <20090609175131.GB13476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-06-10 1:30 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10 1:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> + size_t count)
>>> +{
>>> + struct elv_fq_data *efqd;
>>> + unsigned int data;
>>> + unsigned long flags;
>>> +
>>> + char *p = (char *)name;
>>> +
>>> + data = simple_strtoul(p, &p, 10);
>>> +
>>> + if (data < 0)
>>> + data = 0;
>>> + else if (data > INT_MAX)
>>> + data = INT_MAX;
>> Hi Vivek,
>>
>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>> more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>
> Hi Gui,
>
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.
This patch seems good to me. Some trivial issues comment below.
>
> Thanks
> Vivek
>
> o Previously common layer elevator parameters were appearing as request
> queue parameters in sysfs. But actually these are io scheduler parameters
> in hiearchical mode. Fix it.
>
> o Use macros to define multiple sysfs C functions doing the same thing. Code
> borrowed from CFQ. Helps reduce the number of lines of by 140.
>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
... \
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> +{ \
> + struct elv_fq_data *efqd = &e->efqd; \
> + unsigned int __data; \
> + int ret = elv_var_store(&__data, (page), count); \
Since simple_strtoul returns unsigned long, it's better to make __data
be that type.
> + if (__data < (MIN)) \
> + __data = (MIN); \
> + else if (__data > (MAX)) \
> + __data = (MAX); \
> + if (__CONV) \
> + *(__PTR) = msecs_to_jiffies(__data); \
> + else \
> + *(__PTR) = __data; \
> + return ret; \
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>
> void elv_schedule_dispatch(struct request_queue *q)
> {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
> .store = queue_iostats_store,
> };
>
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> - .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_idle_show,
> - .store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> - .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_async_slice_idle_show,
> - .store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> - .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_sync_show,
> - .store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> - .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_async_show,
> - .store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> - .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_fairness_show,
> - .store = elv_fairness_store,
> -};
> -#endif
> -
> static struct attribute *default_attrs[] = {
> &queue_requests_entry.attr,
> #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
> &queue_nomerges_entry.attr,
> &queue_rq_affinity_entry.attr,
> &queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> - &queue_slice_idle_entry.attr,
> - &queue_async_slice_idle_entry.attr,
> - &queue_slice_sync_entry.attr,
> - &queue_slice_async_entry.attr,
> - &queue_fairness_entry.attr,
> -#endif
> NULL,
> };
>
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(back_seek_max),
> CFQ_ATTR(back_seek_penalty),
> CFQ_ATTR(slice_async_rq),
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(async_slice_idle),
> + ELV_ATTR(slice_sync),
> + ELV_ATTR(slice_async),
> __ATTR_NULL
> };
>
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
> AS_ATTR(antic_expire),
> AS_ATTR(read_batch_expire),
> AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
> DD_ATTR(writes_starved),
> DD_ATTR(front_merges),
> DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
> kfree(nq);
> }
>
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> + __ATTR_NULL
> +};
> +#endif
> +
> static struct elevator_type elevator_noop = {
> .ops = {
> .elevator_merge_req_fn = noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
> },
> #ifdef CONFIG_IOSCHED_NOOP_HIER
> .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> + .elevator_attrs = noop_attrs,
> #endif
> .elevator_name = "noop",
> .elevator_owner = THIS_MODULE,
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-06-09 17:51 ` Vivek Goyal
@ 2009-06-10 1:30 ` Gui Jianfeng
-1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10 1:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> + size_t count)
>>> +{
>>> + struct elv_fq_data *efqd;
>>> + unsigned int data;
>>> + unsigned long flags;
>>> +
>>> + char *p = (char *)name;
>>> +
>>> + data = simple_strtoul(p, &p, 10);
>>> +
>>> + if (data < 0)
>>> + data = 0;
>>> + else if (data > INT_MAX)
>>> + data = INT_MAX;
>> Hi Vivek,
>>
>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>> more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>
> Hi Gui,
>
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.
This patch seems good to me. Some trivial issues comment below.
>
> Thanks
> Vivek
>
> o Previously common layer elevator parameters were appearing as request
> queue parameters in sysfs. But actually these are io scheduler parameters
> in hiearchical mode. Fix it.
>
> o Use macros to define multiple sysfs C functions doing the same thing. Code
> borrowed from CFQ. Helps reduce the number of lines of by 140.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
... \
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> +{ \
> + struct elv_fq_data *efqd = &e->efqd; \
> + unsigned int __data; \
> + int ret = elv_var_store(&__data, (page), count); \
Since simple_strtoul returns unsigned long, it's better to make __data
be that type.
> + if (__data < (MIN)) \
> + __data = (MIN); \
> + else if (__data > (MAX)) \
> + __data = (MAX); \
> + if (__CONV) \
> + *(__PTR) = msecs_to_jiffies(__data); \
> + else \
> + *(__PTR) = __data; \
> + return ret; \
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>
> void elv_schedule_dispatch(struct request_queue *q)
> {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
> .store = queue_iostats_store,
> };
>
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> - .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_idle_show,
> - .store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> - .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_async_slice_idle_show,
> - .store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> - .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_sync_show,
> - .store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> - .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_async_show,
> - .store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> - .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_fairness_show,
> - .store = elv_fairness_store,
> -};
> -#endif
> -
> static struct attribute *default_attrs[] = {
> &queue_requests_entry.attr,
> #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
> &queue_nomerges_entry.attr,
> &queue_rq_affinity_entry.attr,
> &queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> - &queue_slice_idle_entry.attr,
> - &queue_async_slice_idle_entry.attr,
> - &queue_slice_sync_entry.attr,
> - &queue_slice_async_entry.attr,
> - &queue_fairness_entry.attr,
> -#endif
> NULL,
> };
>
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(back_seek_max),
> CFQ_ATTR(back_seek_penalty),
> CFQ_ATTR(slice_async_rq),
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(async_slice_idle),
> + ELV_ATTR(slice_sync),
> + ELV_ATTR(slice_async),
> __ATTR_NULL
> };
>
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
> AS_ATTR(antic_expire),
> AS_ATTR(read_batch_expire),
> AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
> DD_ATTR(writes_starved),
> DD_ATTR(front_merges),
> DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
> kfree(nq);
> }
>
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> + __ATTR_NULL
> +};
> +#endif
> +
> static struct elevator_type elevator_noop = {
> .ops = {
> .elevator_merge_req_fn = noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
> },
> #ifdef CONFIG_IOSCHED_NOOP_HIER
> .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> + .elevator_attrs = noop_attrs,
> #endif
> .elevator_name = "noop",
> .elevator_owner = THIS_MODULE,
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-10 1:30 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10 1:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
fchecconi, s-uchida, containers, linux-kernel, akpm,
righi.andrea
Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> + size_t count)
>>> +{
>>> + struct elv_fq_data *efqd;
>>> + unsigned int data;
>>> + unsigned long flags;
>>> +
>>> + char *p = (char *)name;
>>> +
>>> + data = simple_strtoul(p, &p, 10);
>>> +
>>> + if (data < 0)
>>> + data = 0;
>>> + else if (data > INT_MAX)
>>> + data = INT_MAX;
>> Hi Vivek,
>>
>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>> more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>
> Hi Gui,
>
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.
This patch seems good to me. Some trivial issues comment below.
>
> Thanks
> Vivek
>
> o Previously common layer elevator parameters were appearing as request
> queue parameters in sysfs. But actually these are io scheduler parameters
> in hiearchical mode. Fix it.
>
> o Use macros to define multiple sysfs C functions doing the same thing. Code
> borrowed from CFQ. Helps reduce the number of lines of by 140.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
... \
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> +{ \
> + struct elv_fq_data *efqd = &e->efqd; \
> + unsigned int __data; \
> + int ret = elv_var_store(&__data, (page), count); \
Since simple_strtoul returns unsigned long, it's better to make __data
be that type.
> + if (__data < (MIN)) \
> + __data = (MIN); \
> + else if (__data > (MAX)) \
> + __data = (MAX); \
> + if (__CONV) \
> + *(__PTR) = msecs_to_jiffies(__data); \
> + else \
> + *(__PTR) = __data; \
> + return ret; \
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>
> void elv_schedule_dispatch(struct request_queue *q)
> {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
> .store = queue_iostats_store,
> };
>
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> - .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_idle_show,
> - .store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> - .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_async_slice_idle_show,
> - .store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> - .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_sync_show,
> - .store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> - .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_slice_async_show,
> - .store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> - .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> - .show = elv_fairness_show,
> - .store = elv_fairness_store,
> -};
> -#endif
> -
> static struct attribute *default_attrs[] = {
> &queue_requests_entry.attr,
> #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
> &queue_nomerges_entry.attr,
> &queue_rq_affinity_entry.attr,
> &queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> - &queue_slice_idle_entry.attr,
> - &queue_async_slice_idle_entry.attr,
> - &queue_slice_sync_entry.attr,
> - &queue_slice_async_entry.attr,
> - &queue_fairness_entry.attr,
> -#endif
> NULL,
> };
>
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(back_seek_max),
> CFQ_ATTR(back_seek_penalty),
> CFQ_ATTR(slice_async_rq),
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(async_slice_idle),
> + ELV_ATTR(slice_sync),
> + ELV_ATTR(slice_async),
> __ATTR_NULL
> };
>
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
> AS_ATTR(antic_expire),
> AS_ATTR(read_batch_expire),
> AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
> DD_ATTR(writes_starved),
> DD_ATTR(front_merges),
> DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> +#endif
> __ATTR_NULL
> };
>
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
> kfree(nq);
> }
>
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> + ELV_ATTR(fairness),
> + ELV_ATTR(slice_idle),
> + ELV_ATTR(slice_sync),
> + __ATTR_NULL
> +};
> +#endif
> +
> static struct elevator_type elevator_noop = {
> .ops = {
> .elevator_merge_req_fn = noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
> },
> #ifdef CONFIG_IOSCHED_NOOP_HIER
> .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> + .elevator_attrs = noop_attrs,
> #endif
> .elevator_name = "noop",
> .elevator_owner = THIS_MODULE,
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-06-10 1:30 ` Gui Jianfeng
@ 2009-06-10 13:26 ` Vivek Goyal
-1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> + size_t count)
> >>> +{
> >>> + struct elv_fq_data *efqd;
> >>> + unsigned int data;
> >>> + unsigned long flags;
> >>> +
> >>> + char *p = (char *)name;
> >>> +
> >>> + data = simple_strtoul(p, &p, 10);
> >>> +
> >>> + if (data < 0)
> >>> + data = 0;
> >>> + else if (data > INT_MAX)
> >>> + data = INT_MAX;
> >> Hi Vivek,
> >>
> >> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> >> more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> >
> > Hi Gui,
> >
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
>
> This patch seems good to me. Some trivial issues comment below.
>
> >
> > Thanks
> > Vivek
> >
> > o Previously common layer elevator parameters were appearing as request
> > queue parameters in sysfs. But actually these are io scheduler parameters
> > in hiearchical mode. Fix it.
> >
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> > borrowed from CFQ. Helps reduce the number of lines of by 140.
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ... \
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> > +{ \
> > + struct elv_fq_data *efqd = &e->efqd; \
> > + unsigned int __data; \
> > + int ret = elv_var_store(&__data, (page), count); \
>
> Since simple_strtoul returns unsigned long, it's better to make __data
> be that type.
>
I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value
bigger than unsigned int and if it is, we expect to truncate it?
> > + if (__data < (MIN)) \
> > + __data = (MIN); \
> > + else if (__data > (MAX)) \
> > + __data = (MAX); \
> > + if (__CONV) \
> > + *(__PTR) = msecs_to_jiffies(__data); \
> > + else \
> > + *(__PTR) = __data; \
> > + return ret; \
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>
> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
Again these are the same values CFQ was using. Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-10 13:26 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
fchecconi, s-uchida, containers, linux-kernel, akpm,
righi.andrea
On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> + size_t count)
> >>> +{
> >>> + struct elv_fq_data *efqd;
> >>> + unsigned int data;
> >>> + unsigned long flags;
> >>> +
> >>> + char *p = (char *)name;
> >>> +
> >>> + data = simple_strtoul(p, &p, 10);
> >>> +
> >>> + if (data < 0)
> >>> + data = 0;
> >>> + else if (data > INT_MAX)
> >>> + data = INT_MAX;
> >> Hi Vivek,
> >>
> >> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> >> more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> >
> > Hi Gui,
> >
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
>
> This patch seems good to me. Some trivial issues comment below.
>
> >
> > Thanks
> > Vivek
> >
> > o Previously common layer elevator parameters were appearing as request
> > queue parameters in sysfs. But actually these are io scheduler parameters
> > in hiearchical mode. Fix it.
> >
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> > borrowed from CFQ. Helps reduce the number of lines of by 140.
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ... \
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> > +{ \
> > + struct elv_fq_data *efqd = &e->efqd; \
> > + unsigned int __data; \
> > + int ret = elv_var_store(&__data, (page), count); \
>
> Since simple_strtoul returns unsigned long, it's better to make __data
> be that type.
>
I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value
bigger than unsigned int and if it is, we expect to truncate it?
> > + if (__data < (MIN)) \
> > + __data = (MIN); \
> > + else if (__data > (MAX)) \
> > + __data = (MAX); \
> > + if (__CONV) \
> > + *(__PTR) = msecs_to_jiffies(__data); \
> > + else \
> > + *(__PTR) = __data; \
> > + return ret; \
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>
> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
Again these are the same values CFQ was using. Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
2009-06-10 13:26 ` Vivek Goyal
@ 2009-06-11 1:22 ` Gui Jianfeng
-1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11 1:22 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> + size_t count)
>>>>> +{
>>>>> + struct elv_fq_data *efqd;
>>>>> + unsigned int data;
>>>>> + unsigned long flags;
>>>>> +
>>>>> + char *p = (char *)name;
>>>>> +
>>>>> + data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> + if (data < 0)
>>>>> + data = 0;
>>>>> + else if (data > INT_MAX)
>>>>> + data = INT_MAX;
>>>> Hi Vivek,
>>>>
>>>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>>>> more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>> This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>> queue parameters in sysfs. But actually these are io scheduler parameters
>>> in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>> borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> ... \
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
>>> +{ \
>>> + struct elv_fq_data *efqd = &e->efqd; \
>>> + unsigned int __data; \
>>> + int ret = elv_var_store(&__data, (page), count); \
>> Since simple_strtoul returns unsigned long, it's better to make __data
>> be that type.
>>
>
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value
> bigger than unsigned int and if it is, we expect to truncate it?
>
>>> + if (__data < (MIN)) \
>>> + __data = (MIN); \
>>> + else if (__data > (MAX)) \
>>> + __data = (MAX); \
>>> + if (__CONV) \
>>> + *(__PTR) = msecs_to_jiffies(__data); \
>>> + else \
>>> + *(__PTR) = __data; \
>>> + return ret; \
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
>
> Again these are the same values CFQ was using. Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.
Ok, I don't have strong opinion about the above things.
>
> Thanks
> Vivek
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-11 1:22 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11 1:22 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
fchecconi, s-uchida, containers, linux-kernel, akpm,
righi.andrea
Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> + size_t count)
>>>>> +{
>>>>> + struct elv_fq_data *efqd;
>>>>> + unsigned int data;
>>>>> + unsigned long flags;
>>>>> +
>>>>> + char *p = (char *)name;
>>>>> +
>>>>> + data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> + if (data < 0)
>>>>> + data = 0;
>>>>> + else if (data > INT_MAX)
>>>>> + data = INT_MAX;
>>>> Hi Vivek,
>>>>
>>>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>>>> more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>> This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>> queue parameters in sysfs. But actually these are io scheduler parameters
>>> in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>> borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> ... \
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
>>> +{ \
>>> + struct elv_fq_data *efqd = &e->efqd; \
>>> + unsigned int __data; \
>>> + int ret = elv_var_store(&__data, (page), count); \
>> Since simple_strtoul returns unsigned long, it's better to make __data
>> be that type.
>>
>
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value
> bigger than unsigned int and if it is, we expect to truncate it?
>
>>> + if (__data < (MIN)) \
>>> + __data = (MIN); \
>>> + else if (__data > (MAX)) \
>>> + __data = (MAX); \
>>> + if (__CONV) \
>>> + *(__PTR) = msecs_to_jiffies(__data); \
>>> + else \
>>> + *(__PTR) = __data; \
>>> + return ret; \
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
>
> Again these are the same values CFQ was using. Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.
Ok, I don't have strong opinion about the above things.
>
> Thanks
> Vivek
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090610132638.GB19680-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <20090610132638.GB19680-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-06-11 1:22 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11 1:22 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> + size_t count)
>>>>> +{
>>>>> + struct elv_fq_data *efqd;
>>>>> + unsigned int data;
>>>>> + unsigned long flags;
>>>>> +
>>>>> + char *p = (char *)name;
>>>>> +
>>>>> + data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> + if (data < 0)
>>>>> + data = 0;
>>>>> + else if (data > INT_MAX)
>>>>> + data = INT_MAX;
>>>> Hi Vivek,
>>>>
>>>> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
>>>> more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>> This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>> queue parameters in sysfs. But actually these are io scheduler parameters
>>> in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>> borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ... \
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
>>> +{ \
>>> + struct elv_fq_data *efqd = &e->efqd; \
>>> + unsigned int __data; \
>>> + int ret = elv_var_store(&__data, (page), count); \
>> Since simple_strtoul returns unsigned long, it's better to make __data
>> be that type.
>>
>
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value
> bigger than unsigned int and if it is, we expect to truncate it?
>
>>> + if (__data < (MIN)) \
>>> + __data = (MIN); \
>>> + else if (__data > (MAX)) \
>>> + __data = (MAX); \
>>> + if (__CONV) \
>>> + *(__PTR) = msecs_to_jiffies(__data); \
>>> + else \
>>> + *(__PTR) = __data; \
>>> + return ret; \
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
>
> Again these are the same values CFQ was using. Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.
Ok, I don't have strong opinion about the above things.
>
> Thanks
> Vivek
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A2F0CBE.8030208-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <4A2F0CBE.8030208-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-06-10 13:26 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> + size_t count)
> >>> +{
> >>> + struct elv_fq_data *efqd;
> >>> + unsigned int data;
> >>> + unsigned long flags;
> >>> +
> >>> + char *p = (char *)name;
> >>> +
> >>> + data = simple_strtoul(p, &p, 10);
> >>> +
> >>> + if (data < 0)
> >>> + data = 0;
> >>> + else if (data > INT_MAX)
> >>> + data = INT_MAX;
> >> Hi Vivek,
> >>
> >> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> >> more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> >> ---
> >
> > Hi Gui,
> >
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
>
> This patch seems good to me. Some trivial issues comment below.
>
> >
> > Thanks
> > Vivek
> >
> > o Previously common layer elevator parameters were appearing as request
> > queue parameters in sysfs. But actually these are io scheduler parameters
> > in hiearchical mode. Fix it.
> >
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> > borrowed from CFQ. Helps reduce the number of lines of by 140.
> >
> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ... \
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> > +{ \
> > + struct elv_fq_data *efqd = &e->efqd; \
> > + unsigned int __data; \
> > + int ret = elv_var_store(&__data, (page), count); \
>
> Since simple_strtoul returns unsigned long, it's better to make __data
> be that type.
>
I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value
bigger than unsigned int and if it is, we expect to truncate it?
> > + if (__data < (MIN)) \
> > + __data = (MIN); \
> > + else if (__data > (MAX)) \
> > + __data = (MAX); \
> > + if (__CONV) \
> > + *(__PTR) = msecs_to_jiffies(__data); \
> > + else \
> > + *(__PTR) = __data; \
> > + return ret; \
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>
> Do we need to set an actual max limitation rather than UINT_MAX for these entries?
Again these are the same values CFQ was using. Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-06-09 17:51 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > + size_t count)
> > +{
> > + struct elv_fq_data *efqd;
> > + unsigned int data;
> > + unsigned long flags;
> > +
> > + char *p = (char *)name;
> > +
> > + data = simple_strtoul(p, &p, 10);
> > +
> > + if (data < 0)
> > + data = 0;
> > + else if (data > INT_MAX)
> > + data = INT_MAX;
>
> Hi Vivek,
>
> data might overflow on 64 bit systems. In addition, since "fairness" is nothing
> more than a switch, just let it be.
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
Hi Gui,
How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.
Thanks
Vivek
o Previously common layer elevator parameters were appearing as request
queue parameters in sysfs. But actually these are io scheduler parameters
in hiearchical mode. Fix it.
o Use macros to define multiple sysfs C functions doing the same thing. Code
borrowed from CFQ. Helps reduce the number of lines of by 140.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/as-iosched.c | 5
block/blk-sysfs.c | 39 -------
block/cfq-iosched.c | 5
block/deadline-iosched.c | 5
block/elevator-fq.c | 245 +++++++++++------------------------------------
block/elevator-fq.h | 26 ++--
block/noop-iosched.c | 10 +
7 files changed, 97 insertions(+), 238 deletions(-)
Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h 2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
/**
* struct bfq_service_tree - per ioprio_class service tree.
* @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
* throughput and focus more on providing fairness for sync
* queues.
*/
- int fairness;
+ unsigned int fairness;
int only_root_group;
};
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
#endif /* GROUP_IOSCHED */
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
const char *name, size_t count);
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
size_t count);
/* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c 2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->fairness;
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->fairness = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = jiffies_to_msecs(efqd->elv_async_slice_idle);
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
-
- data = msecs_to_jiffies(data);
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_async_slice_idle = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[1];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
+ return sprintf(page, "%d\n", var);
}
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
- size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[1] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
+ char *p = (char *) page;
+ *var = simple_strtoul(p, &p, 10);
return count;
}
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- data = efqd->elv_slice[0];
- spin_unlock_irqrestore(q->queue_lock, flags);
- return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
- size_t count)
-{
- struct elv_fq_data *efqd;
- unsigned int data;
- unsigned long flags;
-
- char *p = (char *)name;
-
- data = simple_strtoul(p, &p, 10);
-
- if (data < 0)
- data = 0;
- /* 100ms is the limit for now*/
- else if (data > 100)
- data = 100;
-
- spin_lock_irqsave(q->queue_lock, flags);
- efqd = &q->elevator->efqd;
- efqd->elv_slice[0] = data;
- spin_unlock_irqrestore(q->queue_lock, flags);
-
- return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return elv_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct elv_fq_data *efqd = &e->efqd; \
+ unsigned int __data; \
+ int ret = elv_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
void elv_schedule_dispatch(struct request_queue *q)
{
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c 2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c 2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
.store = queue_iostats_store,
};
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
- .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_idle_show,
- .store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
- .attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
- .show = elv_async_slice_idle_show,
- .store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
- .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_sync_show,
- .store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
- .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
- .show = elv_slice_async_show,
- .store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
- .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
- .show = elv_fairness_show,
- .store = elv_fairness_store,
-};
-#endif
-
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
#ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
- &queue_slice_idle_entry.attr,
- &queue_async_slice_idle_entry.attr,
- &queue_slice_sync_entry.attr,
- &queue_slice_async_entry.attr,
- &queue_fairness_entry.attr,
-#endif
NULL,
};
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c 2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
CFQ_ATTR(slice_async_rq),
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(async_slice_idle),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(slice_async),
__ATTR_NULL
};
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c 2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c 2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] =
AS_ATTR(antic_expire),
AS_ATTR(read_batch_expire),
AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c 2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c 2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
DD_ATTR(writes_starved),
DD_ATTR(front_merges),
DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+#endif
__ATTR_NULL
};
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c 2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c 2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct
kfree(nq);
}
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+ __ATTR_NULL
+};
+#endif
+
static struct elevator_type elevator_noop = {
.ops = {
.elevator_merge_req_fn = noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
},
#ifdef CONFIG_IOSCHED_NOOP_HIER
.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+ .elevator_attrs = noop_attrs,
#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-13 15:00 ` Vivek Goyal
2009-06-09 7:56 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
> it will be deleted from the active tree. This will lead to a scenario
> where out of two competing queues, only one is on the tree and when a
> new queue is selected, vtime jump takes place and we don't see services
> provided in proportion to weight.
>
> o In general this is a fundamental problem with fairness of sync queues
> where queues are not continuously backlogged. Looks like idling is
> only solution to make sure such kind of queues can get some decent amount
> of disk bandwidth in the face of competion from continusouly backlogged
> queues. But excessive idling has potential to reduce performance on SSD
> and disks with commnad queuing.
>
> o This patch experiments with waiting for next request to come before a
> queue is expired after it has consumed its time slice. This can ensure
> more accurate fairness numbers in some cases.
>
> o Introduced a tunable "fairness". If set, io-controller will put more
> focus on getting fairness right than getting throughput right.
>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
Following is a fix which should go here. This patch helps me get much
better fairness numbers for sync queues.
o Fix a window where a queue can be expired without doing busy wait for
next request. This fix allows better fairness number for sync queues.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/elevator-fq.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c 2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c 2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
/*
* The active queue has run out of time, expire it and select new.
*/
- if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
- goto expire;
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+ /*
+ * Queue has used up its slice. Wait busy is not on otherwise
+ * we wouldn't have been here. There is a chance that after
+ * slice expiry no request from the queue completed hence
+ * wait busy timer could not be turned on. If that's the case
+ * don't expire the queue yet. Next request completion from
+ * the queue will arm the wait busy timer.
+ */
+ if (efqd->fairness && !ioq->nr_queued
+ && elv_ioq_nr_dispatched(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ } else
+ goto expire;
+ }
/*
* If we have a RT cfqq waiting, then we pre-empt the current non-rt
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-13 15:00 ` Vivek Goyal
@ 2009-06-09 7:56 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-09 7:56 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> + size_t count)
> +{
> + struct elv_fq_data *efqd;
> + unsigned int data;
> + unsigned long flags;
> +
> + char *p = (char *)name;
> +
> + data = simple_strtoul(p, &p, 10);
> +
> + if (data < 0)
> + data = 0;
> + else if (data > INT_MAX)
> + data = INT_MAX;
Hi Vivek,
data might overflow on 64 bit systems. In addition, since "fairness" is nothing
more than a switch, just let it be.
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
block/elevator-fq.c | 10 +++++-----
block/elevator-fq.h | 2 +-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 655162b..42d4279 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2605,7 +2605,7 @@ static inline int is_root_group_ioq(struct request_queue *q,
ssize_t elv_fairness_show(struct request_queue *q, char *name)
{
struct elv_fq_data *efqd;
- unsigned int data;
+ unsigned long data;
unsigned long flags;
spin_lock_irqsave(q->queue_lock, flags);
@@ -2619,17 +2619,17 @@ ssize_t elv_fairness_store(struct request_queue *q, const char *name,
size_t count)
{
struct elv_fq_data *efqd;
- unsigned int data;
+ unsigned long data;
unsigned long flags;
char *p = (char *)name;
data = simple_strtoul(p, &p, 10);
- if (data < 0)
+ if (!data)
data = 0;
- else if (data > INT_MAX)
- data = INT_MAX;
+ else
+ data = 1;
spin_lock_irqsave(q->queue_lock, flags);
efqd = &q->elevator->efqd;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b2bb11a..4fe843a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -359,7 +359,7 @@ struct elv_fq_data {
* throughput and focus more on providing fairness for sync
* queues.
*/
- int fairness;
+ unsigned long fairness;
};
extern int elv_slice_idle;
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 09/18] io-controller: Separate out queue and data
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (14 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (21 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o So far noop, deadline and AS had one common structure called *_data which
contained both the queue information where requests are queued and also
common data used for scheduling. This patch breaks down this common
structure in two parts, *_queue and *_data. This is along the lines of
cfq where all the reuquests are queued in queue and common data and tunables
are part of data.
o It does not change the functionality but this re-organization helps once
noop, deadline and AS are changed to use hierarchical fair queuing.
o looks like queue_empty function is not required and we can check for
q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
not.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/as-iosched.c | 208 ++++++++++++++++++++++++++--------------------
block/deadline-iosched.c | 117 ++++++++++++++++----------
block/elevator.c | 111 +++++++++++++++++++++----
block/noop-iosched.c | 59 ++++++-------
include/linux/elevator.h | 8 ++-
5 files changed, 319 insertions(+), 184 deletions(-)
diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
* or timed out */
};
-struct as_data {
- /*
- * run time data
- */
-
- struct request_queue *q; /* the "owner" queue */
-
+struct as_queue {
/*
* requests (as_rq s) are present on both sort_list and fifo_list
*/
@@ -90,6 +84,14 @@ struct as_data {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+ unsigned long last_check_fifo[2];
+ int write_batch_count; /* max # of reqs in a write batch */
+ int current_write_count; /* how many requests left this batch */
+ int write_batch_idled; /* has the write batch gone idle? */
+};
+
+struct as_data {
+ struct request_queue *q; /* the "owner" queue */
sector_t last_sector[2]; /* last SYNC & ASYNC sectors */
unsigned long exit_prob; /* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
sector_t new_seek_mean;
unsigned long current_batch_expires;
- unsigned long last_check_fifo[2];
int changed_batch; /* 1: waiting for old batch to end */
int new_batch; /* 1: waiting on first read complete */
- int batch_data_dir; /* current batch SYNC / ASYNC */
- int write_batch_count; /* max # of reqs in a write batch */
- int current_write_count; /* how many requests left this batch */
- int write_batch_idled; /* has the write batch gone idle? */
enum anticipation_status antic_status;
unsigned long antic_start; /* jiffies: when it started */
struct timer_list antic_timer; /* anticipatory scheduling timer */
- struct work_struct antic_work; /* Deferred unplugging */
+ struct work_struct antic_work; /* Deferred unplugging */
struct io_context *io_context; /* Identify the expected process */
int ioc_finished; /* IO associated with io_context is finished */
int nr_dispatched;
+ int batch_data_dir; /* current batch SYNC / ASYNC */
/*
* settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
/*
* rb tree support functions
*/
-#define RQ_RB_ROOT(ad, rq) (&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq) (&(asq)->sort_list[rq_is_sync((rq))])
static void as_add_rq_rb(struct as_data *ad, struct request *rq)
{
struct request *alias;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
- while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+ while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
as_move_to_dispatch(ad, alias);
as_antic_stop(ad);
}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
{
- elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+ elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
}
/*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
* what request to process next. Anticipation works on top of this.
*/
static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
{
struct rb_node *rbnext = rb_next(&last->rb_node);
struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
else {
const int data_dir = rq_is_sync(last);
- rbnext = rb_first(&ad->sort_list[data_dir]);
+ rbnext = rb_first(&asq->sort_list[data_dir]);
if (rbnext && rbnext != &last->rb_node)
next = rb_entry_rq(rbnext);
}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
static void as_update_rq(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
/* keep the next_rq cache up to date */
- ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+ asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
/*
* have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
{
unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
long write_time;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
write_time = (jiffies - ad->current_batch_expires) + batch;
if (write_time < 0)
write_time = 0;
- if (write_time > batch && !ad->write_batch_idled) {
+ if (write_time > batch && !asq->write_batch_idled) {
if (write_time > batch * 3)
- ad->write_batch_count /= 2;
+ asq->write_batch_count /= 2;
else
- ad->write_batch_count--;
- } else if (write_time < batch && ad->current_write_count == 0) {
+ asq->write_batch_count--;
+ } else if (write_time < batch && asq->current_write_count == 0) {
if (batch > write_time * 3)
- ad->write_batch_count *= 2;
+ asq->write_batch_count *= 2;
else
- ad->write_batch_count++;
+ asq->write_batch_count++;
}
- if (ad->write_batch_count < 1)
- ad->write_batch_count = 1;
+ if (asq->write_batch_count < 1)
+ asq->write_batch_count = 1;
}
/*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
const int data_dir = rq_is_sync(rq);
struct as_data *ad = q->elevator->elevator_data;
struct io_context *ioc;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
* Update the "next_rq" cache if we are about to remove its
* entry
*/
- if (ad->next_rq[data_dir] == rq)
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ if (asq->next_rq[data_dir] == rq)
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
rq_fifo_clear(rq);
as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
*
* See as_antic_expired comment.
*/
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
{
struct request *rq;
long delta_jif;
- delta_jif = jiffies - ad->last_check_fifo[adir];
+ delta_jif = jiffies - asq->last_check_fifo[adir];
if (unlikely(delta_jif < 0))
delta_jif = -delta_jif;
if (delta_jif < ad->fifo_expire[adir])
return 0;
- ad->last_check_fifo[adir] = jiffies;
+ asq->last_check_fifo[adir] = jiffies;
- if (list_empty(&ad->fifo_list[adir]))
+ if (list_empty(&asq->fifo_list[adir]))
return 0;
- rq = rq_entry_fifo(ad->fifo_list[adir].next);
+ rq = rq_entry_fifo(asq->fifo_list[adir].next);
return time_after(jiffies, rq_fifo_time(rq));
}
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
* as_batch_expired returns true if the current batch has expired. A batch
* is a set of reads or a set of writes.
*/
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
{
if (ad->changed_batch || ad->new_batch)
return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
return time_after(jiffies, ad->current_batch_expires);
return time_after(jiffies, ad->current_batch_expires)
- || ad->current_write_count == 0;
+ || asq->current_write_count == 0;
}
/*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
ad->io_context = NULL;
}
- if (ad->current_write_count != 0)
- ad->current_write_count--;
+ if (asq->current_write_count != 0)
+ asq->current_write_count--;
}
ad->ioc_finished = 0;
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
/*
* take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
static int as_dispatch_request(struct request_queue *q, int force)
{
struct as_data *ad = q->elevator->elevator_data;
- const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
- const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
struct request *rq;
+ struct as_queue *asq = elv_select_sched_queue(q, force);
+ int reads, writes;
+
+ if (!asq)
+ return 0;
+
+ reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+ writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
if (unlikely(force)) {
/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 0;
ad->new_batch = 0;
- while (ad->next_rq[BLK_RW_SYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+ while (asq->next_rq[BLK_RW_SYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
- while (ad->next_rq[BLK_RW_ASYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+ while (asq->next_rq[BLK_RW_ASYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
return dispatched;
}
/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
- if (ad->current_write_count == 0 || !writes)
- ad->write_batch_idled = 1;
+ if (asq->current_write_count == 0 || !writes)
+ asq->write_batch_idled = 1;
}
if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
|| ad->changed_batch)
return 0;
- if (!(reads && writes && as_batch_expired(ad))) {
+ if (!(reads && writes && as_batch_expired(ad, asq))) {
/*
* batch is still running or no reads or no writes
*/
- rq = ad->next_rq[ad->batch_data_dir];
+ rq = asq->next_rq[ad->batch_data_dir];
if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
- if (as_fifo_expired(ad, BLK_RW_SYNC))
+ if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
goto fifo_expired;
if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
if (writes && ad->batch_data_dir == BLK_RW_SYNC)
/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 1;
}
ad->batch_data_dir = BLK_RW_SYNC;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
- ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+ asq->last_check_fifo[ad->batch_data_dir] = jiffies;
goto dispatch_request;
}
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
if (ad->batch_data_dir == BLK_RW_SYNC) {
ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
ad->new_batch = 0;
}
ad->batch_data_dir = BLK_RW_ASYNC;
- ad->current_write_count = ad->write_batch_count;
- ad->write_batch_idled = 0;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->current_write_count = asq->write_batch_count;
+ asq->write_batch_idled = 0;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
goto dispatch_request;
}
@@ -1150,9 +1162,9 @@ dispatch_request:
* If a request has expired, service it.
*/
- if (as_fifo_expired(ad, ad->batch_data_dir)) {
+ if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
fifo_expired:
- rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+ rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
}
if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
int data_dir;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
RQ_SET_STATE(rq, AS_RQ_NEW);
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
as_update_rq(ad, rq); /* keep state machine up to date */
RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
}
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
- struct as_data *ad = q->elevator->elevator_data;
-
- return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
- && list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
static int
as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
- struct as_data *ad = q->elevator->elevator_data;
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
+ struct as_queue *asq = elv_get_sched_queue_current(q);
+
+ if (!asq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
*/
- __rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+ __rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
return ret;
}
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
+{
+ struct as_queue *asq;
+ struct as_data *ad = eq->elevator_data;
+
+ asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+ if (asq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+ asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+ asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+ if (ad)
+ asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+ else
+ asq->write_batch_count = default_write_batch_expire / 10;
+
+ if (asq->write_batch_count < 2)
+ asq->write_batch_count = 2;
+out:
+ return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+ struct as_queue *asq = sched_queue;
+
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+ kfree(asq);
+}
+
static void as_exit_queue(struct elevator_queue *e)
{
struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
del_timer_sync(&ad->antic_timer);
cancel_work_sync(&ad->antic_work);
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
put_io_context(ad->io_context);
kfree(ad);
}
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
init_timer(&ad->antic_timer);
INIT_WORK(&ad->antic_work, as_work_handler);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
- ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
- ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
- ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
- if (ad->write_batch_count < 2)
- ad->write_batch_count = 2;
return ad;
}
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
.elevator_add_req_fn = as_add_request,
.elevator_activate_req_fn = as_activate_request,
.elevator_deactivate_req_fn = as_deactivate_request,
- .elevator_queue_empty_fn = as_queue_empty,
.elevator_completed_req_fn = as_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
.elevator_init_fn = as_init_queue,
.elevator_exit_fn = as_exit_queue,
.trim = as_trim,
+ .elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+ .elevator_free_sched_queue_fn = as_free_as_queue,
},
.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */
-struct deadline_data {
- /*
- * run time data
- */
-
+struct deadline_queue {
/*
* requests (deadline_rq s) are present on both sort_list and fifo_list
*/
- struct rb_root sort_list[2];
+ struct rb_root sort_list[2];
struct list_head fifo_list[2];
-
/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching; /* number of sequential requests made */
- sector_t last_sector; /* head position */
unsigned int starved; /* times reads have starved writes */
+};
+struct deadline_data {
+ struct request_queue *q;
+ sector_t last_sector; /* head position */
/*
* settings that change how the i/o scheduler behaves
*/
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
static inline struct rb_root *
deadline_rb_root(struct deadline_data *dd, struct request *rq)
{
- return &dd->sort_list[rq_data_dir(rq)];
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+ return &dq->sort_list[rq_data_dir(rq)];
}
/*
@@ -87,9 +87,10 @@ static inline void
deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- if (dd->next_rq[data_dir] == rq)
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ if (dq->next_rq[data_dir] == rq)
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
elv_rb_del(deadline_rb_root(dd, rq), rq);
}
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(q, rq);
deadline_add_rq_rb(dd, rq);
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
}
/*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
struct deadline_data *dd = q->elevator->elevator_data;
struct request *__rq;
int ret;
+ struct deadline_queue *dq;
+
+ dq = elv_get_sched_queue_current(q);
+ if (!dq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
if (dd->front_merges) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
- __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ __rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
if (__rq) {
BUG_ON(sector != __rq->sector);
@@ -207,10 +214,11 @@ static void
deadline_move_request(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- dd->next_rq[READ] = NULL;
- dd->next_rq[WRITE] = NULL;
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ dq->next_rq[READ] = NULL;
+ dq->next_rq[WRITE] = NULL;
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
dd->last_sector = rq_end_sector(rq);
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
* deadline_check_fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
*/
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
{
- struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+ struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
/*
* rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int reads = !list_empty(&dd->fifo_list[READ]);
- const int writes = !list_empty(&dd->fifo_list[WRITE]);
+ struct deadline_queue *dq = elv_select_sched_queue(q, force);
+ int reads, writes;
struct request *rq;
int data_dir;
+ if (!dq)
+ return 0;
+
+ reads = !list_empty(&dq->fifo_list[READ]);
+ writes = !list_empty(&dq->fifo_list[WRITE]);
+
/*
* batches are currently reads XOR writes
*/
- if (dd->next_rq[WRITE])
- rq = dd->next_rq[WRITE];
+ if (dq->next_rq[WRITE])
+ rq = dq->next_rq[WRITE];
else
- rq = dd->next_rq[READ];
+ rq = dq->next_rq[READ];
- if (rq && dd->batching < dd->fifo_batch)
+ if (rq && dq->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
goto dispatch_request;
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dq->starved++ >= dd->writes_starved))
goto dispatch_writes;
data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
- dd->starved = 0;
+ dq->starved = 0;
data_dir = WRITE;
@@ -299,48 +313,62 @@ dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
- if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
- rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
- rq = dd->next_rq[data_dir];
+ rq = dq->next_rq[data_dir];
}
- dd->batching = 0;
+ dq->batching = 0;
dispatch_request:
/*
* rq is the selected appropriate request.
*/
- dd->batching++;
+ dq->batching++;
deadline_move_request(dd, rq);
return 1;
}
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct deadline_data *dd = q->elevator->elevator_data;
+ struct deadline_queue *dq;
- return list_empty(&dd->fifo_list[WRITE])
- && list_empty(&dd->fifo_list[READ]);
+ dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+ if (dq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&dq->fifo_list[READ]);
+ INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+ dq->sort_list[READ] = RB_ROOT;
+ dq->sort_list[WRITE] = RB_ROOT;
+out:
+ return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+ void *sched_queue)
+{
+ struct deadline_queue *dq = sched_queue;
+
+ kfree(dq);
}
static void deadline_exit_queue(struct elevator_queue *e)
{
struct deadline_data *dd = e->elevator_data;
- BUG_ON(!list_empty(&dd->fifo_list[READ]));
- BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
kfree(dd);
}
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
if (!dd)
return NULL;
- INIT_LIST_HEAD(&dd->fifo_list[READ]);
- INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
- dd->sort_list[READ] = RB_ROOT;
- dd->sort_list[WRITE] = RB_ROOT;
+ dd->q = q;
dd->fifo_expire[READ] = read_expire;
dd->fifo_expire[WRITE] = write_expire;
dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
.elevator_merge_req_fn = deadline_merged_requests,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn = deadline_add_request,
- .elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_init_fn = deadline_init_queue,
.elevator_exit_fn = deadline_exit_queue,
+ .elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+ .elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
-
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
return e;
}
-static void *elevator_init_queue(struct request_queue *q,
- struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+ struct elevator_queue *eq)
{
- return eq->ops->elevator_init_fn(q);
+ void *data = NULL;
+
+ if (eq->ops->elevator_init_fn) {
+ data = eq->ops->elevator_init_fn(q);
+ if (data)
+ return data;
+ else
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* IO scheduler does not instanciate data (noop), it is not an error */
+ return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+ void *sched_queue)
+{
+ /* Not all io schedulers (cfq) strore sched_queue */
+ if (!sched_queue)
+ return;
+ eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+ struct elevator_queue *eq)
+{
+ void *sched_queue = NULL;
+
+ if (eq->ops->elevator_alloc_sched_queue_fn) {
+ sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+ GFP_KERNEL);
+ if (!sched_queue)
+ return ERR_PTR(-ENOMEM);
+
+ }
+
+ return sched_queue;
}
static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
- void *data)
+ void *data, void *sched_queue)
{
q->elevator = eq;
eq->elevator_data = data;
+ eq->sched_queue = sched_queue;
}
static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
struct elevator_type *e = NULL;
struct elevator_queue *eq;
int ret = 0;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
INIT_LIST_HEAD(&q->queue_head);
q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
if (!eq)
return -ENOMEM;
- data = elevator_init_queue(q, eq);
- if (!data) {
+ data = elevator_init_data(q, eq);
+
+ if (IS_ERR(data)) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, eq);
+
+ if (IS_ERR(sched_queue)) {
kobject_put(&eq->kobj);
return -ENOMEM;
}
- elevator_attach(q, eq, data);
+ elevator_attach(q, eq, data, sched_queue);
return ret;
}
EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elevator_free_sched_queue(e, e->sched_queue);
elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old_elevator, *e;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
/*
* Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (!e)
return 0;
- data = elevator_init_queue(q, e);
- if (!data) {
+ data = elevator_init_data(q, e);
+
+ if (IS_ERR(data)) {
kobject_put(&e->kobj);
- return 0;
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, e);
+
+ if (IS_ERR(sched_queue)) {
+ kobject_put(&e->kobj);
+ return -ENOMEM;
}
/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
/*
* attach and start new elevator
*/
- elevator_attach(q, e, data);
+ elevator_attach(q, e, data, sched_queue);
spin_unlock_irq(q->queue_lock);
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
}
EXPORT_SYMBOL(elv_rb_latter_request);
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
{
- return ioq_sched_queue(rq_ioq(rq));
+ /*
+ * io scheduler is not using fair queuing. Return sched_queue
+ * pointer stored in elevator_queue. It will be null if io
+ * scheduler never stored anything there to begin with (cfq)
+ */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ /*
+ * IO schedueler is using fair queuing infrasture. If io scheduler
+ * has passed a non null rq, retrieve sched_queue pointer from
+ * there. */
+ if (rq)
+ return ioq_sched_queue(rq_ioq(rq));
+
+ return NULL;
}
EXPORT_SYMBOL(elv_get_sched_queue);
/* Select an ioscheduler queue to dispatch request from. */
void *elv_select_sched_queue(struct request_queue *q, int force)
{
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
return ioq_sched_queue(elv_fq_select_ioq(q, force));
}
EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+ return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
#include <linux/module.h>
#include <linux/init.h>
-struct noop_data {
+struct noop_queue {
struct list_head queue;
};
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
static int noop_dispatch(struct request_queue *q, int force)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_select_sched_queue(q, force);
- if (!list_empty(&nd->queue)) {
+ if (!nq)
+ return 0;
+
+ if (!list_empty(&nq->queue)) {
struct request *rq;
- rq = list_entry(nd->queue.next, struct request, queuelist);
+ rq = list_entry(nq->queue.next, struct request, queuelist);
list_del_init(&rq->queuelist);
elv_dispatch_sort(q, rq);
return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
static void noop_add_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
- struct noop_data *nd = q->elevator->elevator_data;
-
- return list_empty(&nd->queue);
+ list_add_tail(&rq->queuelist, &nq->queue);
}
static struct request *
noop_former_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.prev == &nd->queue)
+ if (rq->queuelist.prev == &nq->queue)
return NULL;
return list_entry(rq->queuelist.prev, struct request, queuelist);
}
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
static struct request *
noop_latter_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.next == &nd->queue)
+ if (rq->queuelist.next == &nq->queue)
return NULL;
return list_entry(rq->queuelist.next, struct request, queuelist);
}
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct noop_data *nd;
+ struct noop_queue *nq;
- nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
- if (!nd)
- return NULL;
- INIT_LIST_HEAD(&nd->queue);
- return nd;
+ nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+ if (nq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&nq->queue);
+out:
+ return nq;
}
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
{
- struct noop_data *nd = e->elevator_data;
+ struct noop_queue *nq = sched_queue;
- BUG_ON(!list_empty(&nd->queue));
- kfree(nd);
+ kfree(nq);
}
static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
.elevator_merge_req_fn = noop_merged_requests,
.elevator_dispatch_fn = noop_dispatch,
.elevator_add_req_fn = noop_add_request,
- .elevator_queue_empty_fn = noop_queue_empty,
.elevator_former_req_fn = noop_former_request,
.elevator_latter_req_fn = noop_latter_request,
- .elevator_init_fn = noop_init_queue,
- .elevator_exit_fn = noop_exit_queue,
+ .elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
+ .elevator_free_sched_queue_fn = noop_free_noop_queue,
},
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
@@ -112,6 +114,7 @@ struct elevator_queue
{
struct elevator_ops *ops;
void *elevator_data;
+ void *sched_queue;
struct kobject kobj;
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 09/18] io-controller: Separate out queue and data
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (15 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
` (20 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o So far noop, deadline and AS had one common structure called *_data which
contained both the queue information where requests are queued and also
common data used for scheduling. This patch breaks down this common
structure in two parts, *_queue and *_data. This is along the lines of
cfq where all the reuquests are queued in queue and common data and tunables
are part of data.
o It does not change the functionality but this re-organization helps once
noop, deadline and AS are changed to use hierarchical fair queuing.
o looks like queue_empty function is not required and we can check for
q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
not.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/as-iosched.c | 208 ++++++++++++++++++++++++++--------------------
block/deadline-iosched.c | 117 ++++++++++++++++----------
block/elevator.c | 111 +++++++++++++++++++++----
block/noop-iosched.c | 59 ++++++-------
include/linux/elevator.h | 8 ++-
5 files changed, 319 insertions(+), 184 deletions(-)
diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
* or timed out */
};
-struct as_data {
- /*
- * run time data
- */
-
- struct request_queue *q; /* the "owner" queue */
-
+struct as_queue {
/*
* requests (as_rq s) are present on both sort_list and fifo_list
*/
@@ -90,6 +84,14 @@ struct as_data {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+ unsigned long last_check_fifo[2];
+ int write_batch_count; /* max # of reqs in a write batch */
+ int current_write_count; /* how many requests left this batch */
+ int write_batch_idled; /* has the write batch gone idle? */
+};
+
+struct as_data {
+ struct request_queue *q; /* the "owner" queue */
sector_t last_sector[2]; /* last SYNC & ASYNC sectors */
unsigned long exit_prob; /* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
sector_t new_seek_mean;
unsigned long current_batch_expires;
- unsigned long last_check_fifo[2];
int changed_batch; /* 1: waiting for old batch to end */
int new_batch; /* 1: waiting on first read complete */
- int batch_data_dir; /* current batch SYNC / ASYNC */
- int write_batch_count; /* max # of reqs in a write batch */
- int current_write_count; /* how many requests left this batch */
- int write_batch_idled; /* has the write batch gone idle? */
enum anticipation_status antic_status;
unsigned long antic_start; /* jiffies: when it started */
struct timer_list antic_timer; /* anticipatory scheduling timer */
- struct work_struct antic_work; /* Deferred unplugging */
+ struct work_struct antic_work; /* Deferred unplugging */
struct io_context *io_context; /* Identify the expected process */
int ioc_finished; /* IO associated with io_context is finished */
int nr_dispatched;
+ int batch_data_dir; /* current batch SYNC / ASYNC */
/*
* settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
/*
* rb tree support functions
*/
-#define RQ_RB_ROOT(ad, rq) (&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq) (&(asq)->sort_list[rq_is_sync((rq))])
static void as_add_rq_rb(struct as_data *ad, struct request *rq)
{
struct request *alias;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
- while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+ while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
as_move_to_dispatch(ad, alias);
as_antic_stop(ad);
}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
{
- elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+ elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
}
/*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
* what request to process next. Anticipation works on top of this.
*/
static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
{
struct rb_node *rbnext = rb_next(&last->rb_node);
struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
else {
const int data_dir = rq_is_sync(last);
- rbnext = rb_first(&ad->sort_list[data_dir]);
+ rbnext = rb_first(&asq->sort_list[data_dir]);
if (rbnext && rbnext != &last->rb_node)
next = rb_entry_rq(rbnext);
}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
static void as_update_rq(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
/* keep the next_rq cache up to date */
- ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+ asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
/*
* have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
{
unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
long write_time;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
write_time = (jiffies - ad->current_batch_expires) + batch;
if (write_time < 0)
write_time = 0;
- if (write_time > batch && !ad->write_batch_idled) {
+ if (write_time > batch && !asq->write_batch_idled) {
if (write_time > batch * 3)
- ad->write_batch_count /= 2;
+ asq->write_batch_count /= 2;
else
- ad->write_batch_count--;
- } else if (write_time < batch && ad->current_write_count == 0) {
+ asq->write_batch_count--;
+ } else if (write_time < batch && asq->current_write_count == 0) {
if (batch > write_time * 3)
- ad->write_batch_count *= 2;
+ asq->write_batch_count *= 2;
else
- ad->write_batch_count++;
+ asq->write_batch_count++;
}
- if (ad->write_batch_count < 1)
- ad->write_batch_count = 1;
+ if (asq->write_batch_count < 1)
+ asq->write_batch_count = 1;
}
/*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
const int data_dir = rq_is_sync(rq);
struct as_data *ad = q->elevator->elevator_data;
struct io_context *ioc;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
* Update the "next_rq" cache if we are about to remove its
* entry
*/
- if (ad->next_rq[data_dir] == rq)
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ if (asq->next_rq[data_dir] == rq)
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
rq_fifo_clear(rq);
as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
*
* See as_antic_expired comment.
*/
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
{
struct request *rq;
long delta_jif;
- delta_jif = jiffies - ad->last_check_fifo[adir];
+ delta_jif = jiffies - asq->last_check_fifo[adir];
if (unlikely(delta_jif < 0))
delta_jif = -delta_jif;
if (delta_jif < ad->fifo_expire[adir])
return 0;
- ad->last_check_fifo[adir] = jiffies;
+ asq->last_check_fifo[adir] = jiffies;
- if (list_empty(&ad->fifo_list[adir]))
+ if (list_empty(&asq->fifo_list[adir]))
return 0;
- rq = rq_entry_fifo(ad->fifo_list[adir].next);
+ rq = rq_entry_fifo(asq->fifo_list[adir].next);
return time_after(jiffies, rq_fifo_time(rq));
}
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
* as_batch_expired returns true if the current batch has expired. A batch
* is a set of reads or a set of writes.
*/
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
{
if (ad->changed_batch || ad->new_batch)
return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
return time_after(jiffies, ad->current_batch_expires);
return time_after(jiffies, ad->current_batch_expires)
- || ad->current_write_count == 0;
+ || asq->current_write_count == 0;
}
/*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
ad->io_context = NULL;
}
- if (ad->current_write_count != 0)
- ad->current_write_count--;
+ if (asq->current_write_count != 0)
+ asq->current_write_count--;
}
ad->ioc_finished = 0;
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
/*
* take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
static int as_dispatch_request(struct request_queue *q, int force)
{
struct as_data *ad = q->elevator->elevator_data;
- const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
- const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
struct request *rq;
+ struct as_queue *asq = elv_select_sched_queue(q, force);
+ int reads, writes;
+
+ if (!asq)
+ return 0;
+
+ reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+ writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
if (unlikely(force)) {
/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 0;
ad->new_batch = 0;
- while (ad->next_rq[BLK_RW_SYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+ while (asq->next_rq[BLK_RW_SYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
- while (ad->next_rq[BLK_RW_ASYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+ while (asq->next_rq[BLK_RW_ASYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
return dispatched;
}
/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
- if (ad->current_write_count == 0 || !writes)
- ad->write_batch_idled = 1;
+ if (asq->current_write_count == 0 || !writes)
+ asq->write_batch_idled = 1;
}
if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
|| ad->changed_batch)
return 0;
- if (!(reads && writes && as_batch_expired(ad))) {
+ if (!(reads && writes && as_batch_expired(ad, asq))) {
/*
* batch is still running or no reads or no writes
*/
- rq = ad->next_rq[ad->batch_data_dir];
+ rq = asq->next_rq[ad->batch_data_dir];
if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
- if (as_fifo_expired(ad, BLK_RW_SYNC))
+ if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
goto fifo_expired;
if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
if (writes && ad->batch_data_dir == BLK_RW_SYNC)
/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 1;
}
ad->batch_data_dir = BLK_RW_SYNC;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
- ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+ asq->last_check_fifo[ad->batch_data_dir] = jiffies;
goto dispatch_request;
}
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
if (ad->batch_data_dir == BLK_RW_SYNC) {
ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
ad->new_batch = 0;
}
ad->batch_data_dir = BLK_RW_ASYNC;
- ad->current_write_count = ad->write_batch_count;
- ad->write_batch_idled = 0;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->current_write_count = asq->write_batch_count;
+ asq->write_batch_idled = 0;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
goto dispatch_request;
}
@@ -1150,9 +1162,9 @@ dispatch_request:
* If a request has expired, service it.
*/
- if (as_fifo_expired(ad, ad->batch_data_dir)) {
+ if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
fifo_expired:
- rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+ rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
}
if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
int data_dir;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
RQ_SET_STATE(rq, AS_RQ_NEW);
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
as_update_rq(ad, rq); /* keep state machine up to date */
RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
}
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
- struct as_data *ad = q->elevator->elevator_data;
-
- return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
- && list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
static int
as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
- struct as_data *ad = q->elevator->elevator_data;
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
+ struct as_queue *asq = elv_get_sched_queue_current(q);
+
+ if (!asq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
*/
- __rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+ __rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
return ret;
}
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
+{
+ struct as_queue *asq;
+ struct as_data *ad = eq->elevator_data;
+
+ asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+ if (asq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+ asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+ asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+ if (ad)
+ asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+ else
+ asq->write_batch_count = default_write_batch_expire / 10;
+
+ if (asq->write_batch_count < 2)
+ asq->write_batch_count = 2;
+out:
+ return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+ struct as_queue *asq = sched_queue;
+
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+ kfree(asq);
+}
+
static void as_exit_queue(struct elevator_queue *e)
{
struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
del_timer_sync(&ad->antic_timer);
cancel_work_sync(&ad->antic_work);
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
put_io_context(ad->io_context);
kfree(ad);
}
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
init_timer(&ad->antic_timer);
INIT_WORK(&ad->antic_work, as_work_handler);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
- ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
- ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
- ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
- if (ad->write_batch_count < 2)
- ad->write_batch_count = 2;
return ad;
}
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
.elevator_add_req_fn = as_add_request,
.elevator_activate_req_fn = as_activate_request,
.elevator_deactivate_req_fn = as_deactivate_request,
- .elevator_queue_empty_fn = as_queue_empty,
.elevator_completed_req_fn = as_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
.elevator_init_fn = as_init_queue,
.elevator_exit_fn = as_exit_queue,
.trim = as_trim,
+ .elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+ .elevator_free_sched_queue_fn = as_free_as_queue,
},
.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */
-struct deadline_data {
- /*
- * run time data
- */
-
+struct deadline_queue {
/*
* requests (deadline_rq s) are present on both sort_list and fifo_list
*/
- struct rb_root sort_list[2];
+ struct rb_root sort_list[2];
struct list_head fifo_list[2];
-
/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching; /* number of sequential requests made */
- sector_t last_sector; /* head position */
unsigned int starved; /* times reads have starved writes */
+};
+struct deadline_data {
+ struct request_queue *q;
+ sector_t last_sector; /* head position */
/*
* settings that change how the i/o scheduler behaves
*/
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
static inline struct rb_root *
deadline_rb_root(struct deadline_data *dd, struct request *rq)
{
- return &dd->sort_list[rq_data_dir(rq)];
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+ return &dq->sort_list[rq_data_dir(rq)];
}
/*
@@ -87,9 +87,10 @@ static inline void
deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- if (dd->next_rq[data_dir] == rq)
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ if (dq->next_rq[data_dir] == rq)
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
elv_rb_del(deadline_rb_root(dd, rq), rq);
}
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(q, rq);
deadline_add_rq_rb(dd, rq);
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
}
/*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
struct deadline_data *dd = q->elevator->elevator_data;
struct request *__rq;
int ret;
+ struct deadline_queue *dq;
+
+ dq = elv_get_sched_queue_current(q);
+ if (!dq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
if (dd->front_merges) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
- __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ __rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
if (__rq) {
BUG_ON(sector != __rq->sector);
@@ -207,10 +214,11 @@ static void
deadline_move_request(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- dd->next_rq[READ] = NULL;
- dd->next_rq[WRITE] = NULL;
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ dq->next_rq[READ] = NULL;
+ dq->next_rq[WRITE] = NULL;
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
dd->last_sector = rq_end_sector(rq);
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
* deadline_check_fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
*/
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
{
- struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+ struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
/*
* rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int reads = !list_empty(&dd->fifo_list[READ]);
- const int writes = !list_empty(&dd->fifo_list[WRITE]);
+ struct deadline_queue *dq = elv_select_sched_queue(q, force);
+ int reads, writes;
struct request *rq;
int data_dir;
+ if (!dq)
+ return 0;
+
+ reads = !list_empty(&dq->fifo_list[READ]);
+ writes = !list_empty(&dq->fifo_list[WRITE]);
+
/*
* batches are currently reads XOR writes
*/
- if (dd->next_rq[WRITE])
- rq = dd->next_rq[WRITE];
+ if (dq->next_rq[WRITE])
+ rq = dq->next_rq[WRITE];
else
- rq = dd->next_rq[READ];
+ rq = dq->next_rq[READ];
- if (rq && dd->batching < dd->fifo_batch)
+ if (rq && dq->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
goto dispatch_request;
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dq->starved++ >= dd->writes_starved))
goto dispatch_writes;
data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
- dd->starved = 0;
+ dq->starved = 0;
data_dir = WRITE;
@@ -299,48 +313,62 @@ dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
- if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
- rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
- rq = dd->next_rq[data_dir];
+ rq = dq->next_rq[data_dir];
}
- dd->batching = 0;
+ dq->batching = 0;
dispatch_request:
/*
* rq is the selected appropriate request.
*/
- dd->batching++;
+ dq->batching++;
deadline_move_request(dd, rq);
return 1;
}
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct deadline_data *dd = q->elevator->elevator_data;
+ struct deadline_queue *dq;
- return list_empty(&dd->fifo_list[WRITE])
- && list_empty(&dd->fifo_list[READ]);
+ dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+ if (dq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&dq->fifo_list[READ]);
+ INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+ dq->sort_list[READ] = RB_ROOT;
+ dq->sort_list[WRITE] = RB_ROOT;
+out:
+ return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+ void *sched_queue)
+{
+ struct deadline_queue *dq = sched_queue;
+
+ kfree(dq);
}
static void deadline_exit_queue(struct elevator_queue *e)
{
struct deadline_data *dd = e->elevator_data;
- BUG_ON(!list_empty(&dd->fifo_list[READ]));
- BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
kfree(dd);
}
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
if (!dd)
return NULL;
- INIT_LIST_HEAD(&dd->fifo_list[READ]);
- INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
- dd->sort_list[READ] = RB_ROOT;
- dd->sort_list[WRITE] = RB_ROOT;
+ dd->q = q;
dd->fifo_expire[READ] = read_expire;
dd->fifo_expire[WRITE] = write_expire;
dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
.elevator_merge_req_fn = deadline_merged_requests,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn = deadline_add_request,
- .elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_init_fn = deadline_init_queue,
.elevator_exit_fn = deadline_exit_queue,
+ .elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+ .elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
-
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
return e;
}
-static void *elevator_init_queue(struct request_queue *q,
- struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+ struct elevator_queue *eq)
{
- return eq->ops->elevator_init_fn(q);
+ void *data = NULL;
+
+ if (eq->ops->elevator_init_fn) {
+ data = eq->ops->elevator_init_fn(q);
+ if (data)
+ return data;
+ else
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* IO scheduler does not instanciate data (noop), it is not an error */
+ return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+ void *sched_queue)
+{
+ /* Not all io schedulers (cfq) strore sched_queue */
+ if (!sched_queue)
+ return;
+ eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+ struct elevator_queue *eq)
+{
+ void *sched_queue = NULL;
+
+ if (eq->ops->elevator_alloc_sched_queue_fn) {
+ sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+ GFP_KERNEL);
+ if (!sched_queue)
+ return ERR_PTR(-ENOMEM);
+
+ }
+
+ return sched_queue;
}
static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
- void *data)
+ void *data, void *sched_queue)
{
q->elevator = eq;
eq->elevator_data = data;
+ eq->sched_queue = sched_queue;
}
static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
struct elevator_type *e = NULL;
struct elevator_queue *eq;
int ret = 0;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
INIT_LIST_HEAD(&q->queue_head);
q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
if (!eq)
return -ENOMEM;
- data = elevator_init_queue(q, eq);
- if (!data) {
+ data = elevator_init_data(q, eq);
+
+ if (IS_ERR(data)) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, eq);
+
+ if (IS_ERR(sched_queue)) {
kobject_put(&eq->kobj);
return -ENOMEM;
}
- elevator_attach(q, eq, data);
+ elevator_attach(q, eq, data, sched_queue);
return ret;
}
EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elevator_free_sched_queue(e, e->sched_queue);
elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old_elevator, *e;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
/*
* Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (!e)
return 0;
- data = elevator_init_queue(q, e);
- if (!data) {
+ data = elevator_init_data(q, e);
+
+ if (IS_ERR(data)) {
kobject_put(&e->kobj);
- return 0;
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, e);
+
+ if (IS_ERR(sched_queue)) {
+ kobject_put(&e->kobj);
+ return -ENOMEM;
}
/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
/*
* attach and start new elevator
*/
- elevator_attach(q, e, data);
+ elevator_attach(q, e, data, sched_queue);
spin_unlock_irq(q->queue_lock);
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
}
EXPORT_SYMBOL(elv_rb_latter_request);
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
{
- return ioq_sched_queue(rq_ioq(rq));
+ /*
+ * io scheduler is not using fair queuing. Return sched_queue
+ * pointer stored in elevator_queue. It will be null if io
+ * scheduler never stored anything there to begin with (cfq)
+ */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ /*
+ * IO schedueler is using fair queuing infrasture. If io scheduler
+ * has passed a non null rq, retrieve sched_queue pointer from
+ * there. */
+ if (rq)
+ return ioq_sched_queue(rq_ioq(rq));
+
+ return NULL;
}
EXPORT_SYMBOL(elv_get_sched_queue);
/* Select an ioscheduler queue to dispatch request from. */
void *elv_select_sched_queue(struct request_queue *q, int force)
{
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
return ioq_sched_queue(elv_fq_select_ioq(q, force));
}
EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+ return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
#include <linux/module.h>
#include <linux/init.h>
-struct noop_data {
+struct noop_queue {
struct list_head queue;
};
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
static int noop_dispatch(struct request_queue *q, int force)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_select_sched_queue(q, force);
- if (!list_empty(&nd->queue)) {
+ if (!nq)
+ return 0;
+
+ if (!list_empty(&nq->queue)) {
struct request *rq;
- rq = list_entry(nd->queue.next, struct request, queuelist);
+ rq = list_entry(nq->queue.next, struct request, queuelist);
list_del_init(&rq->queuelist);
elv_dispatch_sort(q, rq);
return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
static void noop_add_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
- struct noop_data *nd = q->elevator->elevator_data;
-
- return list_empty(&nd->queue);
+ list_add_tail(&rq->queuelist, &nq->queue);
}
static struct request *
noop_former_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.prev == &nd->queue)
+ if (rq->queuelist.prev == &nq->queue)
return NULL;
return list_entry(rq->queuelist.prev, struct request, queuelist);
}
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
static struct request *
noop_latter_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.next == &nd->queue)
+ if (rq->queuelist.next == &nq->queue)
return NULL;
return list_entry(rq->queuelist.next, struct request, queuelist);
}
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct noop_data *nd;
+ struct noop_queue *nq;
- nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
- if (!nd)
- return NULL;
- INIT_LIST_HEAD(&nd->queue);
- return nd;
+ nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+ if (nq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&nq->queue);
+out:
+ return nq;
}
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
{
- struct noop_data *nd = e->elevator_data;
+ struct noop_queue *nq = sched_queue;
- BUG_ON(!list_empty(&nd->queue));
- kfree(nd);
+ kfree(nq);
}
static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
.elevator_merge_req_fn = noop_merged_requests,
.elevator_dispatch_fn = noop_dispatch,
.elevator_add_req_fn = noop_add_request,
- .elevator_queue_empty_fn = noop_queue_empty,
.elevator_former_req_fn = noop_former_request,
.elevator_latter_req_fn = noop_latter_request,
- .elevator_init_fn = noop_init_queue,
- .elevator_exit_fn = noop_exit_queue,
+ .elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
+ .elevator_free_sched_queue_fn = noop_free_noop_queue,
},
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
@@ -112,6 +114,7 @@ struct elevator_queue
{
struct elevator_ops *ops;
void *elevator_data;
+ void *sched_queue;
struct kobject kobj;
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (16 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
` (19 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.
noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.
Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 160 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 67 +++++++++++++++++++
block/elevator.c | 35 ++++++++++-
include/linux/elevator.h | 14 ++++
4 files changed, 274 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
/* Free up async idle queue */
elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* Optimization for io schedulers having single ioq */
+ if (elv_iosched_single_ioq(e))
+ elv_release_ioq(e, &iog->ioq);
+#endif
}
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
rq->iog = iog;
}
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask)
+{
+ struct elevator_queue *e = q->elevator;
+ unsigned long flags;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
+ struct io_group *iog;
+ void *sched_q = NULL, *new_sched_q = NULL;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ /* Determine the io group request belongs to */
+ iog = rq->iog;
+ BUG_ON(!iog);
+
+retry:
+ /* Get the iosched queue */
+ ioq = io_group_ioq(iog);
+ if (!ioq) {
+ /* io queue and sched_queue needs to be allocated */
+ BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+ if (new_sched_q) {
+ goto alloc_ioq;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ /* Call io scheduer to create scheduler queue */
+ new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+ e, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+ gfp_mask | __GFP_ZERO);
+ if (!sched_q)
+ goto queue_fail;
+ }
+
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ sched_q = new_sched_q;
+ new_sched_q = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ e->ops->elevator_free_sched_queue_fn(e,
+ sched_q);
+ sched_q = NULL;
+ goto queue_fail;
+ }
+ }
+
+ elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ io_group_set_ioq(iog, ioq);
+ elv_mark_ioq_sync(ioq);
+ }
+
+ if (new_sched_q)
+ e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
+ /* Request reference */
+ elv_get_ioq(ioq);
+ rq->ioq = ioq;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 0;
+
+queue_fail:
+ WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+ elv_schedule_dispatch(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ struct io_group *iog;
+
+ /* Determine the io group and io queue of the bio submitting task */
+ iog = io_lookup_io_group_current(q);
+ if (!iog) {
+ /* May be task belongs to a cgroup for which io group has
+ * not been setup yet. */
+ return NULL;
+ }
+ return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ if (ioq) {
+ rq->ioq = NULL;
+ elv_put_ioq(ioq);
+ }
+}
+
#else /* GROUP_IOSCHED */
void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
{
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->efqd = efqd;
elv_ioq_set_ioprio_class(ioq, ioprio_class);
elv_ioq_set_ioprio(ioq, ioprio);
- ioq->pid = current->pid;
+
+ if (elv_iosched_single_ioq(eq))
+ ioq->pid = 0;
+ else
+ ioq->pid = current->pid;
+
ioq->sched_queue = sched_queue;
if (is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
/* async_queue and idle_queue are used only for cfq */
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
+
+ /* Single ioq per group, used for noop, deadline, anticipatory */
+ struct io_queue *ioq;
};
/**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return iog->entity.weight;
}
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ BUG_ON(!iog);
+ return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+ BUG_ON(!iog);
+ /* io group reference. Will be dropped when group is destroyed. */
+ elv_get_ioq(ioq);
+ iog->ioq = ioq;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return 0;
}
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
}
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
{
void *sched_queue = NULL;
+ /*
+ * If fair queuing is enabled, then queue allocation takes place
+ * during set_request() functions when request actually comes
+ * in.
+ */
+ if (elv_iosched_fair_queuing_enabled(eq))
+ return NULL;
+
if (eq->ops->elevator_alloc_sched_queue_fn) {
sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
elv_fq_set_request_io_group(q, rq);
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e))
+ return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e)) {
+ elv_fq_unset_request_ioq(q, rq);
+ return;
+ }
+
if (e->ops->elevator_put_req_fn)
e->ops->elevator_put_req_fn(rq);
}
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
/*
* Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
*/
void *elv_get_sched_queue_current(struct request_queue *q)
{
- return q->elevator->sched_queue;
+ /* Fair queuing is not enabled. There is only one queue. */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ return ioq_sched_queue(elv_lookup_ioq_current(q));
}
EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
/* iosched wants to use fq logic of elevator layer */
#define ELV_IOSCHED_NEED_FQ 1
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ 2
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
}
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return 0;
}
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (17 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (18 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 11 +++++++++++
block/noop-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
that do their own scheduling and require only minimal assistance from
the kernel.
+config IOSCHED_NOOP_HIER
+ bool "Noop Hierarchical Scheduling support"
+ depends on IOSCHED_NOOP && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in noop. In this mode noop keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_AS
tristate "Anticipatory I/O scheduler"
default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
.elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
.elevator_free_sched_queue_fn = noop_free_noop_queue,
},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
};
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (18 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
` (17 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 11 +++++++++++
block/noop-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
that do their own scheduling and require only minimal assistance from
the kernel.
+config IOSCHED_NOOP_HIER
+ bool "Noop Hierarchical Scheduling support"
+ depends on IOSCHED_NOOP && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in noop. In this mode noop keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_AS
tristate "Anticipatory I/O scheduler"
default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
.elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
.elevator_free_sched_queue_fn = noop_free_noop_queue,
},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
};
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (19 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (16 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 11 +++++++++++
block/deadline-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
a disk at any one time, its behaviour is almost identical to the
anticipatory I/O scheduler and so is a good choice.
+config IOSCHED_DEADLINE_HIER
+ bool "Deadline Hierarchical Scheduling support"
+ depends on IOSCHED_DEADLINE && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in deadline. In this mode deadline keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (20 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
` (15 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 11 +++++++++++
block/deadline-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
a disk at any one time, its behaviour is almost identical to the
anticipatory I/O scheduler and so is a good choice.
+config IOSCHED_DEADLINE_HIER
+ bool "Deadline Hierarchical Scheduling support"
+ depends on IOSCHED_DEADLINE && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in deadline. In this mode deadline keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (21 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (14 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer. One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.
TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
if two "dd" are going in two different cgroups. Needs to be looked into.
- AS write batch number of request adjustment happens upon every W->R batch
direction switch. This automatic adjustment depends on how much time a
read is taking after a W->R switch.
This does not gel very well when hierarhical scheduling is enabled and
every io group can have its separate read/write batch. Now if io group
switching takes place it creates issues.
Currently I have disabled write batch length adjustment in hierarchical
mode.
- Currently performance seems to be very bad in hierarhical mode. Needs
to be looked into.
- I think the whole idea of common layer doing time slice switching between
queues and then queue in turn running timed batches is not very good. May
be AS can maintain two queues (one for READS and other for WRITES) and let
common layer do the time slice switching between these two queues.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 12 +++
block/as-iosched.c | 177 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.c | 76 ++++++++++++++++----
include/linux/elevator.h | 16 ++++
4 files changed, 266 insertions(+), 15 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
deadline I/O scheduler, it can also be slower in some cases
especially some database loads.
+config IOSCHED_AS_HIER
+ bool "Anticipatory Hierarchical Scheduling support"
+ depends on IOSCHED_AS && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in anticipatory. In this mode
+ anticipatory keeps one IO queue per cgroup instead of a global
+ queue. Elevator fair queuing logic ensures fairness among various
+ queues.
+
config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+
+ /*
+ * If an as_queue is switched while a batch is running, then we
+ * store the time left before current batch will expire
+ */
+ long current_batch_time_left;
+
+ /*
+ * batch data dir when queue was scheduled out. This will be used
+ * to setup ad->batch_data_dir when queue is scheduled in.
+ */
+ int saved_batch_data_dir;
+
unsigned long last_check_fifo[2];
int write_batch_count; /* max # of reqs in a write batch */
int current_write_count; /* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Save batch data dir */
+ asq->saved_batch_data_dir = ad->batch_data_dir;
+
+ if (ad->changed_batch) {
+ /*
+ * In case of force expire, we come here. Batch changeover
+ * has been signalled but we are waiting for all the
+ * request to finish from previous batch and then start
+ * the new batch. Can't wait now. Mark that full batch time
+ * needs to be allocated when this queue is scheduled again.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->changed_batch = 0;
+ return;
+ }
+
+ if (ad->new_batch) {
+ /*
+ * We should come here only when new_batch has been set
+ * but no read request has been issued or if it is a forced
+ * expiry.
+ *
+ * In both the cases, new batch has not started yet so
+ * allocate full batch length for next scheduling opportunity.
+ * We don't do write batch size adjustment in hierarchical
+ * AS so that should not be an issue.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->new_batch = 0;
+ return;
+ }
+
+ /* Save how much time is left before current batch expires */
+ if (as_batch_expired(ad, asq))
+ asq->current_batch_time_left = 0;
+ else {
+ asq->current_batch_time_left = ad->current_batch_expires
+ - jiffies;
+ BUG_ON((asq->current_batch_time_left) < 0);
+ }
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Adjust the batch expire time */
+ if (asq->current_batch_time_left)
+ ad->current_batch_expires = jiffies +
+ asq->current_batch_time_left;
+ /* restore asq batch_data_dir info */
+ ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
+{
+ struct as_queue *asq = sched_queue;
+ struct as_data *ad = q->elevator->elevator_data;
+
+ as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+ int slice_expired, int force)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+ int status = ad->antic_status;
+ struct as_queue *asq = sched_queue;
+
+ /* Forced expiry. We don't have a choice */
+ if (force) {
+ as_antic_stop(ad);
+ as_save_batch_context(ad, asq);
+ return 1;
+ }
+
+ /*
+ * We are waiting for requests to finish from last
+ * batch. Don't expire the queue now
+ */
+ if (ad->changed_batch)
+ goto keep_queue;
+
+ /*
+ * Wait for all requests from existing batch to finish before we
+ * switch the queue. New queue might change the batch direction
+ * and this is to be consistent with AS philosophy of not dispatching
+ * new requests to underlying drive till requests from requests
+ * from previous batch are completed.
+ */
+ if (ad->nr_dispatched)
+ goto keep_queue;
+
+ /*
+ * If AS anticipation is ON, stop it if slice expired, otherwise
+ * keep the queue.
+ */
+ if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+ if (slice_expired)
+ as_antic_stop(ad);
+ else
+ /*
+ * We are anticipating and time slice has not expired
+ * so I would rather prefer waiting than break the
+ * anticipation and expire the queue.
+ */
+ goto keep_queue;
+ }
+
+ /* We are good to expire the queue. Save batch context */
+ as_save_batch_context(ad, asq);
+ return 1;
+
+keep_queue:
+ return 0;
+}
+#endif
/*
* IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
}
}
+#ifndef CONFIG_IOSCHED_AS_HIER
/*
* Gathers timings and resizes the write batch automatically
*/
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
if (asq->write_batch_count < 1)
asq->write_batch_count = 1;
}
+#endif /* !CONFIG_IOSCHED_AS_HIER */
/*
* as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
* and writeback caches
*/
if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+ /*
+ * Dynamic updation of write batch length is disabled
+ * for hierarchical scheduling. It is difficult to do
+ * accurate accounting when queue switch can take place
+ * in the middle of the batch.
+ *
+ * Say, A, B are two groups. Following is the sequence of
+ * events.
+ *
+ * Servicing Write batch of A.
+ * Queue switch takes place and write batch of B starts.
+ * Batch switch takes place and read batch of B starts.
+ *
+ * In above scenario, writes issued in write batch of A
+ * might impact the write batch length of B. Which is not
+ * good.
+ */
update_write_batch(ad);
+#endif
ad->current_batch_expires = jiffies +
ad->batch_expire[BLK_RW_SYNC];
ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
.trim = as_trim,
.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+ .elevator_expire_ioq_fn = as_expire_ioq,
+ .elevator_active_ioq_set_fn = as_active_ioq_set,
},
-
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+ },
+#endif
.elevator_attrs = as_attrs,
.elevator_name = "anticipatory",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
int old_idle, enable_idle;
struct elv_fq_data *efqd = ioq->efqd;
+ /* If idling is disabled from ioscheduler, return */
+ if (!elv_gen_idling_enabled(eq))
+ return;
/*
* Don't idle for async or idle io prio class
*/
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->pid = current->pid;
ioq->sched_queue = sched_queue;
- if (is_sync && !elv_ioq_class_idle(ioq))
+ if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
bfq_init_entity(&ioq->entity, iog);
ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
{
elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 0, 1)) {
+ elv_ioq_slice_expired(q);
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
- elv_activate_ioq(ioq, 1);
- elv_ioq_set_slice_end(ioq, 0);
- elv_mark_ioq_slice_new(ioq);
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+ }
}
void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
elv_deactivate_ioq(efqd, ioq, 0);
}
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * is useful when elevator wants to drain iosched and wants to
+ * expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until
+ * and unless necessary. Currently AS can deny the switch if
+ * in the middle of batch switch.
+ *
+ * if 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to
+ * expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (e->ops->elevator_expire_ioq_fn)
+ return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+ slice_expired, force);
+
+ return 1;
+}
+
/* Common layer function to select the next queue to dispatch from */
void *elv_fq_select_ioq(struct request_queue *q, int force)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+ int slice_expired = 1;
if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto keep_queue;
}
+ slice_expired = 0;
expire:
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, slice_expired, force))
+ elv_ioq_slice_expired(q);
+ else {
+ ioq = NULL;
+ goto keep_queue;
+ }
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
}
if (elv_ioq_class_idle(ioq)) {
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
goto done;
}
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq))
- elv_ioq_slice_expired(q);
- else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ if (elv_ioq_slice_used(ioq)) {
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
+ } else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
elv_ioq_arm_slice_timer(q, 0);
}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
struct request*);
typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
#endif
struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
elevator_should_preempt_fn *elevator_should_preempt_fn;
elevator_update_idle_window_fn *elevator_update_idle_window_fn;
elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+ elevator_expire_ioq_fn *elevator_expire_ioq_fn;
#endif
};
@@ -252,6 +254,9 @@ enum {
/* iosched maintains only single ioq per group.*/
#define ELV_IOSCHED_SINGLE_IOQ 2
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE 4
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
}
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return 0;
}
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (22 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
` (13 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer. One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.
TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
if two "dd" are going in two different cgroups. Needs to be looked into.
- AS write batch number of request adjustment happens upon every W->R batch
direction switch. This automatic adjustment depends on how much time a
read is taking after a W->R switch.
This does not gel very well when hierarhical scheduling is enabled and
every io group can have its separate read/write batch. Now if io group
switching takes place it creates issues.
Currently I have disabled write batch length adjustment in hierarchical
mode.
- Currently performance seems to be very bad in hierarhical mode. Needs
to be looked into.
- I think the whole idea of common layer doing time slice switching between
queues and then queue in turn running timed batches is not very good. May
be AS can maintain two queues (one for READS and other for WRITES) and let
common layer do the time slice switching between these two queues.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 12 +++
block/as-iosched.c | 177 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.c | 76 ++++++++++++++++----
include/linux/elevator.h | 16 ++++
4 files changed, 266 insertions(+), 15 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
deadline I/O scheduler, it can also be slower in some cases
especially some database loads.
+config IOSCHED_AS_HIER
+ bool "Anticipatory Hierarchical Scheduling support"
+ depends on IOSCHED_AS && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in anticipatory. In this mode
+ anticipatory keeps one IO queue per cgroup instead of a global
+ queue. Elevator fair queuing logic ensures fairness among various
+ queues.
+
config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+
+ /*
+ * If an as_queue is switched while a batch is running, then we
+ * store the time left before current batch will expire
+ */
+ long current_batch_time_left;
+
+ /*
+ * batch data dir when queue was scheduled out. This will be used
+ * to setup ad->batch_data_dir when queue is scheduled in.
+ */
+ int saved_batch_data_dir;
+
unsigned long last_check_fifo[2];
int write_batch_count; /* max # of reqs in a write batch */
int current_write_count; /* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Save batch data dir */
+ asq->saved_batch_data_dir = ad->batch_data_dir;
+
+ if (ad->changed_batch) {
+ /*
+ * In case of force expire, we come here. Batch changeover
+ * has been signalled but we are waiting for all the
+ * request to finish from previous batch and then start
+ * the new batch. Can't wait now. Mark that full batch time
+ * needs to be allocated when this queue is scheduled again.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->changed_batch = 0;
+ return;
+ }
+
+ if (ad->new_batch) {
+ /*
+ * We should come here only when new_batch has been set
+ * but no read request has been issued or if it is a forced
+ * expiry.
+ *
+ * In both the cases, new batch has not started yet so
+ * allocate full batch length for next scheduling opportunity.
+ * We don't do write batch size adjustment in hierarchical
+ * AS so that should not be an issue.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->new_batch = 0;
+ return;
+ }
+
+ /* Save how much time is left before current batch expires */
+ if (as_batch_expired(ad, asq))
+ asq->current_batch_time_left = 0;
+ else {
+ asq->current_batch_time_left = ad->current_batch_expires
+ - jiffies;
+ BUG_ON((asq->current_batch_time_left) < 0);
+ }
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Adjust the batch expire time */
+ if (asq->current_batch_time_left)
+ ad->current_batch_expires = jiffies +
+ asq->current_batch_time_left;
+ /* restore asq batch_data_dir info */
+ ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
+{
+ struct as_queue *asq = sched_queue;
+ struct as_data *ad = q->elevator->elevator_data;
+
+ as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+ int slice_expired, int force)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+ int status = ad->antic_status;
+ struct as_queue *asq = sched_queue;
+
+ /* Forced expiry. We don't have a choice */
+ if (force) {
+ as_antic_stop(ad);
+ as_save_batch_context(ad, asq);
+ return 1;
+ }
+
+ /*
+ * We are waiting for requests to finish from last
+ * batch. Don't expire the queue now
+ */
+ if (ad->changed_batch)
+ goto keep_queue;
+
+ /*
+ * Wait for all requests from existing batch to finish before we
+ * switch the queue. New queue might change the batch direction
+ * and this is to be consistent with AS philosophy of not dispatching
+ * new requests to underlying drive till requests from requests
+ * from previous batch are completed.
+ */
+ if (ad->nr_dispatched)
+ goto keep_queue;
+
+ /*
+ * If AS anticipation is ON, stop it if slice expired, otherwise
+ * keep the queue.
+ */
+ if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+ if (slice_expired)
+ as_antic_stop(ad);
+ else
+ /*
+ * We are anticipating and time slice has not expired
+ * so I would rather prefer waiting than break the
+ * anticipation and expire the queue.
+ */
+ goto keep_queue;
+ }
+
+ /* We are good to expire the queue. Save batch context */
+ as_save_batch_context(ad, asq);
+ return 1;
+
+keep_queue:
+ return 0;
+}
+#endif
/*
* IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
}
}
+#ifndef CONFIG_IOSCHED_AS_HIER
/*
* Gathers timings and resizes the write batch automatically
*/
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
if (asq->write_batch_count < 1)
asq->write_batch_count = 1;
}
+#endif /* !CONFIG_IOSCHED_AS_HIER */
/*
* as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
* and writeback caches
*/
if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+ /*
+ * Dynamic updation of write batch length is disabled
+ * for hierarchical scheduling. It is difficult to do
+ * accurate accounting when queue switch can take place
+ * in the middle of the batch.
+ *
+ * Say, A, B are two groups. Following is the sequence of
+ * events.
+ *
+ * Servicing Write batch of A.
+ * Queue switch takes place and write batch of B starts.
+ * Batch switch takes place and read batch of B starts.
+ *
+ * In above scenario, writes issued in write batch of A
+ * might impact the write batch length of B. Which is not
+ * good.
+ */
update_write_batch(ad);
+#endif
ad->current_batch_expires = jiffies +
ad->batch_expire[BLK_RW_SYNC];
ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
.trim = as_trim,
.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+ .elevator_expire_ioq_fn = as_expire_ioq,
+ .elevator_active_ioq_set_fn = as_active_ioq_set,
},
-
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+ },
+#endif
.elevator_attrs = as_attrs,
.elevator_name = "anticipatory",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
int old_idle, enable_idle;
struct elv_fq_data *efqd = ioq->efqd;
+ /* If idling is disabled from ioscheduler, return */
+ if (!elv_gen_idling_enabled(eq))
+ return;
/*
* Don't idle for async or idle io prio class
*/
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->pid = current->pid;
ioq->sched_queue = sched_queue;
- if (is_sync && !elv_ioq_class_idle(ioq))
+ if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
bfq_init_entity(&ioq->entity, iog);
ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
{
elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 0, 1)) {
+ elv_ioq_slice_expired(q);
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
- elv_activate_ioq(ioq, 1);
- elv_ioq_set_slice_end(ioq, 0);
- elv_mark_ioq_slice_new(ioq);
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+ }
}
void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
elv_deactivate_ioq(efqd, ioq, 0);
}
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * is useful when elevator wants to drain iosched and wants to
+ * expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until
+ * and unless necessary. Currently AS can deny the switch if
+ * in the middle of batch switch.
+ *
+ * if 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to
+ * expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (e->ops->elevator_expire_ioq_fn)
+ return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+ slice_expired, force);
+
+ return 1;
+}
+
/* Common layer function to select the next queue to dispatch from */
void *elv_fq_select_ioq(struct request_queue *q, int force)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+ int slice_expired = 1;
if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto keep_queue;
}
+ slice_expired = 0;
expire:
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, slice_expired, force))
+ elv_ioq_slice_expired(q);
+ else {
+ ioq = NULL;
+ goto keep_queue;
+ }
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
}
if (elv_ioq_class_idle(ioq)) {
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
goto done;
}
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq))
- elv_ioq_slice_expired(q);
- else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ if (elv_ioq_slice_used(ioq)) {
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
+ } else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
elv_ioq_arm_slice_timer(q, 0);
}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
struct request*);
typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
#endif
struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
elevator_should_preempt_fn *elevator_should_preempt_fn;
elevator_update_idle_window_fn *elevator_update_idle_window_fn;
elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+ elevator_expire_ioq_fn *elevator_expire_ioq_fn;
#endif
};
@@ -252,6 +254,9 @@ enum {
/* iosched maintains only single ioq per group.*/
#define ELV_IOSCHED_SINGLE_IOQ 2
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE 4
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
}
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return 0;
}
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (23 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (12 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o blkio_cgroup patches from Ryo to track async bios.
o Fernando is also working on another IO tracking mechanism. We are not
particular about any IO tracking mechanism. This patchset can make use
of any mechanism which makes it to upstream. For the time being making
use of Ryo's posting.
Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
block/blk-ioc.c | 37 +++---
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/biotrack.h | 97 +++++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 31 ++++-
init/Kconfig | 15 ++
mm/Makefile | 4 +-
mm/biotrack.c | 300 +++++++++++++++++++++++++++++++++++++++++
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 5 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 17 ++-
mm/swap_state.c | 2 +
19 files changed, 511 insertions(+), 30 deletions(-)
create mode 100644 include/linux/biotrack.h
create mode 100644 mm/biotrack.c
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
}
}
+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ioc->cgroup_changed = 0;
+#endif
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
- ret->cgroup_changed = 0;
-#endif
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);
return ret;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);
while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_cgroup_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else /* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
/* */
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;
static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
}
#endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT (16)
+#define PCG_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
endif # CGROUPS
+config CGROUP_BLKIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
config MM_OWNER
bool
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+ .io_context = &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *biog;
+ struct page_cgroup *pc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0); /* 0: default blkio_cgroup id */
+ unlock_page_cgroup(pc);
+ if (!mm)
+ return;
+
+ rcu_read_lock();
+ biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog)) {
+ rcu_read_unlock();
+ return;
+ }
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "biog" so the css_id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default_blkio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_inc(&biog->io_context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ if (!biog)
+ return ERR_PTR(-ENOMEM);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR_PTR(-ENOMEM);
+ }
+ biog->io_context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ put_io_context(biog->io_context);
+ free_css_id(&blkio_cgroup_subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc) {
+ lock_page_cgroup(pc);
+ id = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+ unsigned long id;
+
+ id = get_blkio_cgroup_id(bio);
+ rcu_read_lock();
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (css)
+ biog = container_of(css, struct blkio_cgroup, css);
+ else
+ biog = &default_blkio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ rcu_read_unlock();
+ return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id: blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css;
+
+ if (blkio_cgroup_disabled())
+ return NULL;
+
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (!css)
+ return NULL;
+ cgrp = css->cgroup;
+ return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+ unsigned long id;
+
+ rcu_read_lock();
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+ .name = "blkio",
+ .create = blkio_cgroup_create,
+ .destroy = blkio_cgroup_destroy,
+ .populate = blkio_cgroup_populate,
+ .subsys_id = blkio_cgroup_subsys_id,
+ .use_id = 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/hash.h>
#include <linux/highmem.h>
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
#include <trace/block.h>
#include <asm/tlbflush.h>
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);
if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto release;
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
set_pte_at(mm, address, page_table, entry);
/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/biotrack.h>
static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
int nid, fail;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for_each_online_node(nid) {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
" don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
panic("Out of memory");
}
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+ " if you don't want\n");
}
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
#include <asm/pgtable.h>
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (24 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
` (11 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o blkio_cgroup patches from Ryo to track async bios.
o Fernando is also working on another IO tracking mechanism. We are not
particular about any IO tracking mechanism. This patchset can make use
of any mechanism which makes it to upstream. For the time being making
use of Ryo's posting.
Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
block/blk-ioc.c | 37 +++---
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/biotrack.h | 97 +++++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 31 ++++-
init/Kconfig | 15 ++
mm/Makefile | 4 +-
mm/biotrack.c | 300 +++++++++++++++++++++++++++++++++++++++++
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 5 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 17 ++-
mm/swap_state.c | 2 +
19 files changed, 511 insertions(+), 30 deletions(-)
create mode 100644 include/linux/biotrack.h
create mode 100644 mm/biotrack.c
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
}
}
+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ioc->cgroup_changed = 0;
+#endif
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
- ret->cgroup_changed = 0;
-#endif
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);
return ret;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);
while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_cgroup_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else /* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
/* */
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;
static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
}
#endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT (16)
+#define PCG_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
endif # CGROUPS
+config CGROUP_BLKIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
config MM_OWNER
bool
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+ .io_context = &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *biog;
+ struct page_cgroup *pc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0); /* 0: default blkio_cgroup id */
+ unlock_page_cgroup(pc);
+ if (!mm)
+ return;
+
+ rcu_read_lock();
+ biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog)) {
+ rcu_read_unlock();
+ return;
+ }
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "biog" so the css_id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default_blkio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_inc(&biog->io_context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ if (!biog)
+ return ERR_PTR(-ENOMEM);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR_PTR(-ENOMEM);
+ }
+ biog->io_context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ put_io_context(biog->io_context);
+ free_css_id(&blkio_cgroup_subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc) {
+ lock_page_cgroup(pc);
+ id = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+ unsigned long id;
+
+ id = get_blkio_cgroup_id(bio);
+ rcu_read_lock();
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (css)
+ biog = container_of(css, struct blkio_cgroup, css);
+ else
+ biog = &default_blkio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ rcu_read_unlock();
+ return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id: blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css;
+
+ if (blkio_cgroup_disabled())
+ return NULL;
+
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (!css)
+ return NULL;
+ cgrp = css->cgroup;
+ return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+ unsigned long id;
+
+ rcu_read_lock();
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+ .name = "blkio",
+ .create = blkio_cgroup_create,
+ .destroy = blkio_cgroup_destroy,
+ .populate = blkio_cgroup_populate,
+ .subsys_id = blkio_cgroup_subsys_id,
+ .use_id = 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/hash.h>
#include <linux/highmem.h>
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
#include <trace/block.h>
#include <asm/tlbflush.h>
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);
if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto release;
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
set_pte_at(mm, address, page_table, entry);
/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/biotrack.h>
static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
int nid, fail;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for_each_online_node(nid) {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
" don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
panic("Out of memory");
}
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+ " if you don't want\n");
}
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
#include <asm/pgtable.h>
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (25 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (10 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o So far we were assuming that a bio/rq belongs to the task who is submitting
it. It did not hold good in case of async writes. This patch makes use of
blkio_cgroup pataches to attribute the aysnc writes to right group instead
of task submitting the bio.
o For sync requests, we continue to assume that io belongs to the task
submitting it. Only in case of async requests, we make use of io tracking
patches to track the owner cgroup.
o So far cfq always caches the async queue pointer. With async requests now
not necessarily being tied to submitting task io context, caching the
pointer will not help for async queues. This patch introduces a new config
option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
old behavior where async queue pointer is cached in task context. If it
is not set, async queue pointer is not cached and we take help of bio
tracking patches to determine group bio belongs to and then map it to
async queue of that group.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 16 +++++
block/as-iosched.c | 2 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 149 ++++++++++++++++++++++++++++++++++++----------
block/deadline-iosched.c | 2 +-
block/elevator-fq.c | 131 ++++++++++++++++++++++++++++++++++-------
block/elevator-fq.h | 34 +++++++++-
block/elevator.c | 13 ++--
include/linux/elevator.h | 19 +++++-
9 files changed, 304 insertions(+), 69 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
+config TRACK_ASYNC_CONTEXT
+ bool "Determine async request context from bio"
+ depends on GROUP_IOSCHED
+ select CGROUP_BLKIO
+ default n
+ ---help---
+ Normally async request is attributed to the task submitting the
+ request. With group ioscheduling, for accurate accounting of
+ async writes, one needs to map the request to original task/cgroup
+ which originated the request and not the submitter of the request.
+
+ Currently there are generic io tracking patches to provide facility
+ to map bio to original owner. If this option is set, for async
+ request, original owner of the bio is decided by using io tracking
+ patches otherwise we continue to attribute the request to the
+ submitting thread.
endmenu
endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
- struct as_queue *asq = elv_get_sched_queue_current(q);
+ struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
if (!asq)
return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}
static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;
if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+ int, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[!!is_sync];
}
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
- struct cfq_queue *cfqq, int is_sync)
-{
- cic->cfqq[!!is_sync] = cfqq;
-}
-
/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
*/
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
{
- if (bio_data_dir(bio) == READ || bio_sync(bio))
- return 1;
+ struct cfq_queue *cfqq = NULL;
- return 0;
+ cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct io_group *iog;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+ if (!iog) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
+ return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+ struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
+ cic->cfqq[!!is_sync] = cfqq;
}
static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
/*
* Disallow merge of a sync bio into an async request.
*/
- if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+ if (elv_bio_sync(bio) && !rq_is_sync(rq))
return 0;
/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return 0;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq == RQ_CFQQ(rq))
return 1;
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
+ struct io_group *iog = io_lookup_io_group_current(q);
struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+ /*
+ * Drop the reference to old queue unconditionally. Don't
+ * worry whether new async prio queue has been allocated
+ * or not.
+ */
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue(cfqq);
+
+ /*
+ * Why to allocate new queue now? Will it not be automatically
+ * allocated whenever another async request from same context
+ * comes? Keeping it for the time being because existing cfq
+ * code allocates the new queue immediately upon prio change
+ */
+ new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
- }
+ if (new_cfqq)
+ cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
}
cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_IOSCHED_CFQ_HIER */
static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
if (!cfqq) {
if (new_cfqq) {
goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
- elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
- cfqq->org_ioprio, is_sync);
+ elv_init_ioq(q->elevator, ioq, iog, cfqq,
+ cfqq->org_ioprio_class, cfqq->org_ioprio,
+ is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
}
static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
}
if (!cfqq) {
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
if (!cfqq)
return NULL;
}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
if (!is_sync && !async_cfqq)
io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- /* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ * This comes from the fact that we might be getting async requests
+ * which belong to a different cgroup altogether than the cgroup
+ * iocontext belongs to. And this thread might be submitting bios
+ * from various cgroups. So every time async queue will be different
+ * based on the cgroup of the bio/rq. Can't cache the async cfqq
+ * pointer in cic.
+ */
+ if (is_sync)
+ elv_get_ioq(cfqq->ioq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
elv_get_ioq(cfqq->ioq);
+#endif
return cfqq;
}
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+ gfp_mask);
if (!cfqq)
goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
int ret;
struct deadline_queue *dq;
- dq = elv_get_sched_queue_current(q);
+ dq = elv_get_sched_queue_bio(q, bio);
if (!dq)
return ELEVATOR_NO_MERGE;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
#include <linux/blkdev.h>
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
void elv_activate_ioq(struct io_queue *ioq, int add_front);
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
static int bfq_update_next_active(struct io_sched_data *sd)
{
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &io_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
struct io_cgroup, css);
}
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
return NULL;
}
+/* Lookup the io group of the current task */
struct io_group *io_lookup_io_group_current(struct request_queue *q)
{
struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
return iog;
}
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+ unsigned long bio_cgroup_id;
+ struct cgroup *cgroup;
+
+ /* blk_get_request can reach here without passing a bio */
+ if (!bio)
+ return NULL;
+
+ if (bio_barrier(bio)) {
+ /*
+ * Map barrier requests to root group. May be more special
+ * bio cases should come here
+ */
+ return NULL;
+ }
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (elv_bio_sync(bio)) {
+ /* sync io. Determine cgroup from submitting task context. */
+ cgroup = task_cgroup(current, io_subsys_id);
+ return cgroup;
+ }
+
+ /* Async io. Determine cgroup from with cgroup id stored in page */
+ bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+ if (!bio_cgroup_id)
+ return NULL;
+
+ cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+ cgroup = task_cgroup(current, io_subsys_id);
+#endif
+ return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+ struct cgroup *cgrp;
+ struct io_cgroup *iocg = NULL;
+
+ cgrp = get_cgroup_from_bio(bio);
+ if (!cgrp)
+ return &io_root_cgroup;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ if (!iocg)
+ return &io_root_cgroup;
+
+ return iocg;
+}
+
/*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
*/
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
{
struct cgroup *cgroup;
struct io_group *iog;
struct elv_fq_data *efqd = &q->elevator->efqd;
rcu_read_lock();
- cgroup = task_cgroup(current, io_subsys_id);
- iog = io_find_alloc_group(q, cgroup, efqd, create);
- if (!iog) {
+ cgroup = get_cgroup_from_bio(bio);
+ if (!cgroup) {
if (create)
iog = efqd->root_group;
- else
+ else {
/*
* bio merge functions doing lookup don't want to
* map bio to root group by default
*/
iog = NULL;
+ }
+ goto out;
+ }
+
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ iog = NULL;
}
+out:
rcu_read_unlock();
return iog;
}
+EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
return 1;
/* Determine the io group of the bio submitting task */
- iog = io_get_io_group(q, 0);
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
/* May be task belongs to a differet cgroup for which io
* group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
}
/* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+ struct bio *bio)
{
struct io_group *iog;
unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
* io group to which rq belongs. Later we should make use of
* bio cgroup patches to determine the io group */
spin_lock_irqsave(q->queue_lock, flags);
- iog = io_get_io_group(q, 1);
+ iog = io_get_io_group_bio(q, bio, 1);
spin_unlock_irqrestore(q->queue_lock, flags);
BUG_ON(!iog);
@@ -1797,7 +1870,7 @@ alloc_ioq:
}
}
- elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
}
@@ -1822,17 +1895,17 @@ queue_fail:
}
/*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
* per io group io schedulers.
*/
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
struct io_group *iog;
- /* Determine the io group and io queue of the bio submitting task */
- iog = io_lookup_io_group_current(q);
+ /* lookup the io group and io queue of the bio submitting task */
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
- /* May be task belongs to a cgroup for which io group has
+ /* May be bio belongs to a cgroup for which io group has
* not been setup yet. */
return NULL;
}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
+{
+ return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
return q->elevator->efqd.root_group;
}
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+ return q->elevator->efqd.root_group;
+}
+
#endif /* CONFIG_GROUP_IOSCHED*/
/* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
EXPORT_SYMBOL(elv_free_ioq);
int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio,
- int is_sync)
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync)
{
struct elv_fq_data *efqd = &eq->efqd;
- struct io_group *iog = io_lookup_io_group_current(efqd->queue);
RB_CLEAR_NODE(&ioq->entity.rb_node);
atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
+ if (ioq)
+ elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+ elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+ elv_ioq_nr_dispatched(ioq));
return ioq;
}
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return;
- elv_log_ioq(efqd, ioq, "complete");
+ elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+ elv_ioq_nr_dispatched(ioq));
elv_update_hw_tag(efqd);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq);
+ struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
extern void elv_fq_unset_request_ioq(struct request_queue *q,
struct request *rq);
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
iog->ioq = ioq;
}
+static inline struct io_group *rq_iog(struct request_queue *q,
+ struct request *rq)
+{
+ return rq->iog;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
*/
static inline void io_disconnect_groups(struct elevator_queue *e) {}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
extern void __elv_ioq_slice_expired(struct request_queue *q,
struct io_queue *ioq);
extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync);
extern void elv_schedule_dispatch(struct request_queue *q);
extern int elv_hw_tag(struct elevator_queue *e);
extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
int ioprio, struct io_queue *ioq);
extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+ struct bio *bio, int create);
extern int elv_nr_busy_ioq(struct elevator_queue *e);
extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
- elv_fq_set_request_io_group(q, rq);
+ elv_fq_set_request_io_group(q, rq, bio);
/*
* Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
EXPORT_SYMBOL(elv_select_sched_queue);
/*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
*
* If fair queuing is enabled, determine the io group of task and retrieve
* the ioq pointer from that. This is used by only single queue ioschedulers
* for retrieving the queue associated with the group to decide whether the
* new bio can do a front merge or not.
*/
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
{
/* Fair queuing is not enabled. There is only one queue. */
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return q->elevator->sched_queue;
- return ioq_sched_queue(elv_lookup_ioq_current(q));
+ return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
}
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+ if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+ return 1;
+ return 0;
+}
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (26 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
` (9 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o So far we were assuming that a bio/rq belongs to the task who is submitting
it. It did not hold good in case of async writes. This patch makes use of
blkio_cgroup pataches to attribute the aysnc writes to right group instead
of task submitting the bio.
o For sync requests, we continue to assume that io belongs to the task
submitting it. Only in case of async requests, we make use of io tracking
patches to track the owner cgroup.
o So far cfq always caches the async queue pointer. With async requests now
not necessarily being tied to submitting task io context, caching the
pointer will not help for async queues. This patch introduces a new config
option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
old behavior where async queue pointer is cached in task context. If it
is not set, async queue pointer is not cached and we take help of bio
tracking patches to determine group bio belongs to and then map it to
async queue of that group.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 16 +++++
block/as-iosched.c | 2 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 149 ++++++++++++++++++++++++++++++++++++----------
block/deadline-iosched.c | 2 +-
block/elevator-fq.c | 131 ++++++++++++++++++++++++++++++++++-------
block/elevator-fq.h | 34 +++++++++-
block/elevator.c | 13 ++--
include/linux/elevator.h | 19 +++++-
9 files changed, 304 insertions(+), 69 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
+config TRACK_ASYNC_CONTEXT
+ bool "Determine async request context from bio"
+ depends on GROUP_IOSCHED
+ select CGROUP_BLKIO
+ default n
+ ---help---
+ Normally async request is attributed to the task submitting the
+ request. With group ioscheduling, for accurate accounting of
+ async writes, one needs to map the request to original task/cgroup
+ which originated the request and not the submitter of the request.
+
+ Currently there are generic io tracking patches to provide facility
+ to map bio to original owner. If this option is set, for async
+ request, original owner of the bio is decided by using io tracking
+ patches otherwise we continue to attribute the request to the
+ submitting thread.
endmenu
endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
- struct as_queue *asq = elv_get_sched_queue_current(q);
+ struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
if (!asq)
return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}
static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;
if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+ int, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[!!is_sync];
}
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
- struct cfq_queue *cfqq, int is_sync)
-{
- cic->cfqq[!!is_sync] = cfqq;
-}
-
/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
*/
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
{
- if (bio_data_dir(bio) == READ || bio_sync(bio))
- return 1;
+ struct cfq_queue *cfqq = NULL;
- return 0;
+ cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct io_group *iog;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+ if (!iog) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
+ return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+ struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
+ cic->cfqq[!!is_sync] = cfqq;
}
static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
/*
* Disallow merge of a sync bio into an async request.
*/
- if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+ if (elv_bio_sync(bio) && !rq_is_sync(rq))
return 0;
/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return 0;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq == RQ_CFQQ(rq))
return 1;
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
+ struct io_group *iog = io_lookup_io_group_current(q);
struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+ /*
+ * Drop the reference to old queue unconditionally. Don't
+ * worry whether new async prio queue has been allocated
+ * or not.
+ */
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue(cfqq);
+
+ /*
+ * Why to allocate new queue now? Will it not be automatically
+ * allocated whenever another async request from same context
+ * comes? Keeping it for the time being because existing cfq
+ * code allocates the new queue immediately upon prio change
+ */
+ new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
- }
+ if (new_cfqq)
+ cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
}
cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_IOSCHED_CFQ_HIER */
static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
if (!cfqq) {
if (new_cfqq) {
goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
- elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
- cfqq->org_ioprio, is_sync);
+ elv_init_ioq(q->elevator, ioq, iog, cfqq,
+ cfqq->org_ioprio_class, cfqq->org_ioprio,
+ is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
}
static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
}
if (!cfqq) {
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
if (!cfqq)
return NULL;
}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
if (!is_sync && !async_cfqq)
io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- /* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ * This comes from the fact that we might be getting async requests
+ * which belong to a different cgroup altogether than the cgroup
+ * iocontext belongs to. And this thread might be submitting bios
+ * from various cgroups. So every time async queue will be different
+ * based on the cgroup of the bio/rq. Can't cache the async cfqq
+ * pointer in cic.
+ */
+ if (is_sync)
+ elv_get_ioq(cfqq->ioq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
elv_get_ioq(cfqq->ioq);
+#endif
return cfqq;
}
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+ gfp_mask);
if (!cfqq)
goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
int ret;
struct deadline_queue *dq;
- dq = elv_get_sched_queue_current(q);
+ dq = elv_get_sched_queue_bio(q, bio);
if (!dq)
return ELEVATOR_NO_MERGE;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
#include <linux/blkdev.h>
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
void elv_activate_ioq(struct io_queue *ioq, int add_front);
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
static int bfq_update_next_active(struct io_sched_data *sd)
{
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &io_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
struct io_cgroup, css);
}
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
return NULL;
}
+/* Lookup the io group of the current task */
struct io_group *io_lookup_io_group_current(struct request_queue *q)
{
struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
return iog;
}
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+ unsigned long bio_cgroup_id;
+ struct cgroup *cgroup;
+
+ /* blk_get_request can reach here without passing a bio */
+ if (!bio)
+ return NULL;
+
+ if (bio_barrier(bio)) {
+ /*
+ * Map barrier requests to root group. May be more special
+ * bio cases should come here
+ */
+ return NULL;
+ }
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (elv_bio_sync(bio)) {
+ /* sync io. Determine cgroup from submitting task context. */
+ cgroup = task_cgroup(current, io_subsys_id);
+ return cgroup;
+ }
+
+ /* Async io. Determine cgroup from with cgroup id stored in page */
+ bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+ if (!bio_cgroup_id)
+ return NULL;
+
+ cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+ cgroup = task_cgroup(current, io_subsys_id);
+#endif
+ return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+ struct cgroup *cgrp;
+ struct io_cgroup *iocg = NULL;
+
+ cgrp = get_cgroup_from_bio(bio);
+ if (!cgrp)
+ return &io_root_cgroup;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ if (!iocg)
+ return &io_root_cgroup;
+
+ return iocg;
+}
+
/*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
*/
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
{
struct cgroup *cgroup;
struct io_group *iog;
struct elv_fq_data *efqd = &q->elevator->efqd;
rcu_read_lock();
- cgroup = task_cgroup(current, io_subsys_id);
- iog = io_find_alloc_group(q, cgroup, efqd, create);
- if (!iog) {
+ cgroup = get_cgroup_from_bio(bio);
+ if (!cgroup) {
if (create)
iog = efqd->root_group;
- else
+ else {
/*
* bio merge functions doing lookup don't want to
* map bio to root group by default
*/
iog = NULL;
+ }
+ goto out;
+ }
+
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ iog = NULL;
}
+out:
rcu_read_unlock();
return iog;
}
+EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
return 1;
/* Determine the io group of the bio submitting task */
- iog = io_get_io_group(q, 0);
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
/* May be task belongs to a differet cgroup for which io
* group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
}
/* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+ struct bio *bio)
{
struct io_group *iog;
unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
* io group to which rq belongs. Later we should make use of
* bio cgroup patches to determine the io group */
spin_lock_irqsave(q->queue_lock, flags);
- iog = io_get_io_group(q, 1);
+ iog = io_get_io_group_bio(q, bio, 1);
spin_unlock_irqrestore(q->queue_lock, flags);
BUG_ON(!iog);
@@ -1797,7 +1870,7 @@ alloc_ioq:
}
}
- elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
}
@@ -1822,17 +1895,17 @@ queue_fail:
}
/*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
* per io group io schedulers.
*/
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
struct io_group *iog;
- /* Determine the io group and io queue of the bio submitting task */
- iog = io_lookup_io_group_current(q);
+ /* lookup the io group and io queue of the bio submitting task */
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
- /* May be task belongs to a cgroup for which io group has
+ /* May be bio belongs to a cgroup for which io group has
* not been setup yet. */
return NULL;
}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
+{
+ return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
return q->elevator->efqd.root_group;
}
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+ return q->elevator->efqd.root_group;
+}
+
#endif /* CONFIG_GROUP_IOSCHED*/
/* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
EXPORT_SYMBOL(elv_free_ioq);
int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio,
- int is_sync)
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync)
{
struct elv_fq_data *efqd = &eq->efqd;
- struct io_group *iog = io_lookup_io_group_current(efqd->queue);
RB_CLEAR_NODE(&ioq->entity.rb_node);
atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
+ if (ioq)
+ elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+ elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+ elv_ioq_nr_dispatched(ioq));
return ioq;
}
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return;
- elv_log_ioq(efqd, ioq, "complete");
+ elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+ elv_ioq_nr_dispatched(ioq));
elv_update_hw_tag(efqd);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq);
+ struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
extern void elv_fq_unset_request_ioq(struct request_queue *q,
struct request *rq);
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
iog->ioq = ioq;
}
+static inline struct io_group *rq_iog(struct request_queue *q,
+ struct request *rq)
+{
+ return rq->iog;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
*/
static inline void io_disconnect_groups(struct elevator_queue *e) {}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
extern void __elv_ioq_slice_expired(struct request_queue *q,
struct io_queue *ioq);
extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync);
extern void elv_schedule_dispatch(struct request_queue *q);
extern int elv_hw_tag(struct elevator_queue *e);
extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
int ioprio, struct io_queue *ioq);
extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+ struct bio *bio, int create);
extern int elv_nr_busy_ioq(struct elevator_queue *e);
extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
- elv_fq_set_request_io_group(q, rq);
+ elv_fq_set_request_io_group(q, rq, bio);
/*
* Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
EXPORT_SYMBOL(elv_select_sched_queue);
/*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
*
* If fair queuing is enabled, determine the io group of task and retrieve
* the ioq pointer from that. This is used by only single queue ioschedulers
* for retrieving the queue associated with the group to decide whether the
* new bio can do a front merge or not.
*/
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
{
/* Fair queuing is not enabled. There is only one queue. */
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return q->elevator->sched_queue;
- return ioq_sched_queue(elv_lookup_ioq_current(q));
+ return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
}
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+ if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+ return 1;
+ return 0;
+}
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 16/18] io-controller: Per cgroup request descriptor support
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (27 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (8 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o Currently a request queue has got fixed number of request descriptors for
sync and async requests. Once the request descriptors are consumed, new
processes are put to sleep and they effectively become serialized. Because
sync and async queues are separate, async requests don't impact sync ones
but if one is looking for fairness between async requests, that is not
achievable if request queue descriptors become bottleneck.
o Make request descriptor's per io group so that if there is lots of IO
going on in one cgroup, it does not impact the IO of other group.
o This is just one relatively simple way of doing things. This patch will
probably change after the feedback. Folks have raised concerns that in
hierchical setup, child's request descriptors should be capped by parent's
request descriptors. May be we need to have per cgroup per device files
in cgroups where one can specify the upper limit of request descriptors
and whenever a cgroup is created one needs to assign request descritor
limit making sure total sum of child's request descriptor is not more than
of parent.
I guess something like memory controller. Anyway, that would be the next
step. For the time being, we have implemented something simpler as follows.
o This patch implements the per cgroup request descriptors. request pool per
queue is still common but every group will have its own wait list and its
own count of request descriptors allocated to that group for sync and async
queues. So effectively request_list becomes per io group property and not a
global request queue feature.
o Currently one can define q->nr_requests to limit request descriptors
allocated for the queue. Now there is another tunable q->nr_group_requests
which controls the requests descriptr limit per group. q->nr_requests
supercedes q->nr_group_requests to make sure if there are lots of groups
present, we don't end up allocating too many request descriptors on the
queue.
o Issues: Currently notion of congestion is per queue. With per group request
descriptor it is possible that queue is not congested but the group bio
will go into is congested.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-core.c | 216 ++++++++++++++++++++++++++++++++++--------------
block/blk-settings.c | 3 +
block/blk-sysfs.c | 57 ++++++++++---
block/elevator-fq.c | 14 +++
block/elevator-fq.h | 5 +
block/elevator.c | 6 +-
include/linux/blkdev.h | 62 +++++++++++++-
7 files changed, 283 insertions(+), 80 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
- rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+ struct request_list *rl = blk_get_request_list(q, NULL);
+
+ /*
+ * In case of group scheduling, request list is inside the associated
+ * group and when that group is instanciated, it takes care of
+ * initializing the request list also.
+ */
+ blk_init_request_list(rl);
+#endif
+ q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);
- if (!rl->rq_pool)
+ if (!q->rq_data.rq_pool)
return -ENOMEM;
return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
return NULL;
}
+ /* init starved waiter wait queue */
+ init_waitqueue_head(&q->rq_data.starved_wait);
+
/*
* if caller didn't supply a lock, they get per-queue locking with
* our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
{
if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
}
static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
if (!rq)
return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
if (priv) {
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
return NULL;
}
rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);
- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
* A request has just been released. Account for it, update the full and
* congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
+ BUG_ON(!rl->count[sync]);
rl->count[sync]--;
+
+ BUG_ON(!q->rq_data.count[sync]);
+ q->rq_data.count[sync]--;
+
if (priv)
- rl->elvpriv--;
+ q->rq_data.elvpriv--;
- __freed_request(q, sync);
+ __freed_request(q, sync, rl);
if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+
+ /* Wake up the starved process on global list, if any */
+ if (unlikely(q->rq_data.starved)) {
+ if (waitqueue_active(&q->rq_data.starved_wait))
+ wake_up(&q->rq_data.starved_wait);
+ q->rq_data.starved--;
+ }
}
/*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;
- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
- /*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
- */
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
+ if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, is_sync);
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+ blk_set_queue_full(q, is_sync);
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so set
+ * it as full, and mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else {
+ if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
+ /*
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
+ */
+ goto out;
}
}
- blk_set_queue_congested(q, is_sync);
}
/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+ if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+ goto out;
+
+ /*
+ * Allocation of request is allowed from queue perspective. Now check
+ * from per group request list
+ */
+
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
goto out;
rl->count[is_sync]++;
rl->starved[is_sync] = 0;
+ q->rq_data.count[is_sync]++;
+
priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
- rl->elvpriv++;
+ q->rq_data.elvpriv++;
if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ if (rq) {
+ /*
+ * TODO. Implement group reference counting and take the
+ * reference to the group to make sure group hence request
+ * list does not go away till rq finishes.
+ */
+ rq->rl = rl;
+ }
+#endif
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
/*
* in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
- rl->starved[is_sync] = 1;
-
- goto out;
+ if (unlikely(rl->count[is_sync] == 0)) {
+ /*
+ * If there is a request pending in other direction
+ * in same io group, then set the starved flag of
+ * the group request list. Otherwise, we need to
+ * make this process sleep in global starved list
+ * to make sure it will not sleep indefinitely.
+ */
+ if (rl->count[is_sync ^ 1] != 0) {
+ rl->starved[is_sync] = 1;
+ goto out;
+ } else {
+ /*
+ * It indicates to calling function to put
+ * task on global starved list. Not the best
+ * way
+ */
+ return ERR_PTR(-ENOMEM);
+ }
+ }
}
/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
- while (!rq) {
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+ while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;
- prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
- TASK_UNINTERRUPTIBLE);
+ if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+ /*
+ * Task failed allocation and needs to wait and
+ * try again. There are no requests pending from
+ * the io group hence need to sleep on global
+ * wait queue. Most likely the allocation failed
+ * because of memory issues.
+ */
+
+ q->rq_data.starved++;
+ prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+ &wait, TASK_UNINTERRUPTIBLE);
+ } else {
+ prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+ TASK_UNINTERRUPTIBLE);
+ }
trace_block_sleeprq(q, bio, rw_flags & 1);
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
spin_lock_irq(q->queue_lock);
finish_wait(&rl->wait[is_sync], &wait);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ /*
+ * After the sleep check the rl again in case cgrop bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
};
return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
BUG_ON(rw != READ && rw != WRITE);
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rq = get_request(q, rw, NULL, gfp_mask, rl);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);
BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
unsigned long nr;
int ret = queue_var_store(&nr, page, count);
if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_SYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_SYNC);
- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_ASYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_ASYNC);
- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}
- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_ASYNC);
wake_up(&rl->wait[BLK_RW_ASYNC]);
}
spin_unlock_irq(q->queue_lock);
return ret;
}
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}
+#endif
static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;
blk_sync_queue(q);
- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_data.rq_pool)
+ mempool_destroy(q->rq_data.rq_pool);
if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
struct io_cgroup, css);
}
+struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+ struct io_group *iog;
+
+ iog = io_get_io_group_bio(q, bio, 1);
+ BUG_ON(!iog);
+ return &iog->rl;
+}
+
/*
* Search the bfq_group for bfqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+ blk_init_request_list(&iog->rl);
+
if (leaf == NULL) {
leaf = iog;
prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ blk_init_request_list(&iog->rl);
+
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
/* Single ioq per group, used for noop, deadline, anticipatory */
struct io_queue *ioq;
+
+ /* request list associated with the group */
+ struct request_list rl;
};
/**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_data.elvpriv) {
blk_start_queueing(q);
spin_unlock_irq(q->queue_lock);
msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
}
if (unplug_it && blk_queue_plugged(q)) {
- int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
- - q->in_flight;
+ int nrq = q->rq_data.count[BLK_RW_SYNC] +
+ q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
if (nrq >= q->unplug_thresh)
__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
struct sg_io_hdr;
#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 256 /* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ 64 /* Default maximum */
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ /* Default maximum */
+#endif
struct request;
typedef void (rq_end_io_fn)(struct request *, int);
struct request_list {
/*
- * count[], starved[], and wait[] are indexed by
+ * count[], starved and wait[] are indexed by
* BLK_RW_SYNC/BLK_RW_ASYNC
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
+ int starved;
+ /*
+ * Global list for starved tasks. A task will be queued here if
+ * it could not allocate request descriptor and the associated
+ * group request list does not have any requests pending.
+ */
+ wait_queue_head_t starved_wait;
};
/*
@@ -253,6 +283,7 @@ struct request {
#ifdef CONFIG_GROUP_IOSCHED
/* io group request belongs to */
struct io_group *iog;
+ struct request_list *rl;
#endif /* GROUP_IOSCHED */
#endif /* ELV_FAIR_QUEUING */
};
@@ -342,6 +373,9 @@ struct request_queue
*/
struct request_list rq;
+ /* Contains request pool and other data like starved data */
+ struct request_data rq_data;
+
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ /* Max # of per io group requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return io_group_get_request_list(q, bio);
+#else
+ return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+ struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return rq->rl;
+#else
+ return blk_get_request_list(q, NULL);
+#endif
+}
+
/*
* Temporary export, until SCSI gets fixed up.
*/
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 16/18] io-controller: Per cgroup request descriptor support
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (28 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
` (7 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o Currently a request queue has got fixed number of request descriptors for
sync and async requests. Once the request descriptors are consumed, new
processes are put to sleep and they effectively become serialized. Because
sync and async queues are separate, async requests don't impact sync ones
but if one is looking for fairness between async requests, that is not
achievable if request queue descriptors become bottleneck.
o Make request descriptor's per io group so that if there is lots of IO
going on in one cgroup, it does not impact the IO of other group.
o This is just one relatively simple way of doing things. This patch will
probably change after the feedback. Folks have raised concerns that in
hierchical setup, child's request descriptors should be capped by parent's
request descriptors. May be we need to have per cgroup per device files
in cgroups where one can specify the upper limit of request descriptors
and whenever a cgroup is created one needs to assign request descritor
limit making sure total sum of child's request descriptor is not more than
of parent.
I guess something like memory controller. Anyway, that would be the next
step. For the time being, we have implemented something simpler as follows.
o This patch implements the per cgroup request descriptors. request pool per
queue is still common but every group will have its own wait list and its
own count of request descriptors allocated to that group for sync and async
queues. So effectively request_list becomes per io group property and not a
global request queue feature.
o Currently one can define q->nr_requests to limit request descriptors
allocated for the queue. Now there is another tunable q->nr_group_requests
which controls the requests descriptr limit per group. q->nr_requests
supercedes q->nr_group_requests to make sure if there are lots of groups
present, we don't end up allocating too many request descriptors on the
queue.
o Issues: Currently notion of congestion is per queue. With per group request
descriptor it is possible that queue is not congested but the group bio
will go into is congested.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/blk-core.c | 216 ++++++++++++++++++++++++++++++++++--------------
block/blk-settings.c | 3 +
block/blk-sysfs.c | 57 ++++++++++---
block/elevator-fq.c | 14 +++
block/elevator-fq.h | 5 +
block/elevator.c | 6 +-
include/linux/blkdev.h | 62 +++++++++++++-
7 files changed, 283 insertions(+), 80 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
- rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+ struct request_list *rl = blk_get_request_list(q, NULL);
+
+ /*
+ * In case of group scheduling, request list is inside the associated
+ * group and when that group is instanciated, it takes care of
+ * initializing the request list also.
+ */
+ blk_init_request_list(rl);
+#endif
+ q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);
- if (!rl->rq_pool)
+ if (!q->rq_data.rq_pool)
return -ENOMEM;
return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
return NULL;
}
+ /* init starved waiter wait queue */
+ init_waitqueue_head(&q->rq_data.starved_wait);
+
/*
* if caller didn't supply a lock, they get per-queue locking with
* our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
{
if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
}
static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
if (!rq)
return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
if (priv) {
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
return NULL;
}
rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);
- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
* A request has just been released. Account for it, update the full and
* congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
+ BUG_ON(!rl->count[sync]);
rl->count[sync]--;
+
+ BUG_ON(!q->rq_data.count[sync]);
+ q->rq_data.count[sync]--;
+
if (priv)
- rl->elvpriv--;
+ q->rq_data.elvpriv--;
- __freed_request(q, sync);
+ __freed_request(q, sync, rl);
if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+
+ /* Wake up the starved process on global list, if any */
+ if (unlikely(q->rq_data.starved)) {
+ if (waitqueue_active(&q->rq_data.starved_wait))
+ wake_up(&q->rq_data.starved_wait);
+ q->rq_data.starved--;
+ }
}
/*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;
- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
- /*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
- */
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
+ if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, is_sync);
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+ blk_set_queue_full(q, is_sync);
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so set
+ * it as full, and mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else {
+ if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
+ /*
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
+ */
+ goto out;
}
}
- blk_set_queue_congested(q, is_sync);
}
/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+ if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+ goto out;
+
+ /*
+ * Allocation of request is allowed from queue perspective. Now check
+ * from per group request list
+ */
+
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
goto out;
rl->count[is_sync]++;
rl->starved[is_sync] = 0;
+ q->rq_data.count[is_sync]++;
+
priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
- rl->elvpriv++;
+ q->rq_data.elvpriv++;
if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ if (rq) {
+ /*
+ * TODO. Implement group reference counting and take the
+ * reference to the group to make sure group hence request
+ * list does not go away till rq finishes.
+ */
+ rq->rl = rl;
+ }
+#endif
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
/*
* in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
- rl->starved[is_sync] = 1;
-
- goto out;
+ if (unlikely(rl->count[is_sync] == 0)) {
+ /*
+ * If there is a request pending in other direction
+ * in same io group, then set the starved flag of
+ * the group request list. Otherwise, we need to
+ * make this process sleep in global starved list
+ * to make sure it will not sleep indefinitely.
+ */
+ if (rl->count[is_sync ^ 1] != 0) {
+ rl->starved[is_sync] = 1;
+ goto out;
+ } else {
+ /*
+ * It indicates to calling function to put
+ * task on global starved list. Not the best
+ * way
+ */
+ return ERR_PTR(-ENOMEM);
+ }
+ }
}
/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
- while (!rq) {
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+ while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;
- prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
- TASK_UNINTERRUPTIBLE);
+ if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+ /*
+ * Task failed allocation and needs to wait and
+ * try again. There are no requests pending from
+ * the io group hence need to sleep on global
+ * wait queue. Most likely the allocation failed
+ * because of memory issues.
+ */
+
+ q->rq_data.starved++;
+ prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+ &wait, TASK_UNINTERRUPTIBLE);
+ } else {
+ prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+ TASK_UNINTERRUPTIBLE);
+ }
trace_block_sleeprq(q, bio, rw_flags & 1);
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
spin_lock_irq(q->queue_lock);
finish_wait(&rl->wait[is_sync], &wait);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ /*
+ * After the sleep check the rl again in case cgrop bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
};
return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
BUG_ON(rw != READ && rw != WRITE);
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rq = get_request(q, rw, NULL, gfp_mask, rl);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);
BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
unsigned long nr;
int ret = queue_var_store(&nr, page, count);
if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_SYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_SYNC);
- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_ASYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_ASYNC);
- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}
- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_ASYNC);
wake_up(&rl->wait[BLK_RW_ASYNC]);
}
spin_unlock_irq(q->queue_lock);
return ret;
}
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}
+#endif
static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;
blk_sync_queue(q);
- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_data.rq_pool)
+ mempool_destroy(q->rq_data.rq_pool);
if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
struct io_cgroup, css);
}
+struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+ struct io_group *iog;
+
+ iog = io_get_io_group_bio(q, bio, 1);
+ BUG_ON(!iog);
+ return &iog->rl;
+}
+
/*
* Search the bfq_group for bfqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+ blk_init_request_list(&iog->rl);
+
if (leaf == NULL) {
leaf = iog;
prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ blk_init_request_list(&iog->rl);
+
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
/* Single ioq per group, used for noop, deadline, anticipatory */
struct io_queue *ioq;
+
+ /* request list associated with the group */
+ struct request_list rl;
};
/**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_data.elvpriv) {
blk_start_queueing(q);
spin_unlock_irq(q->queue_lock);
msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
}
if (unplug_it && blk_queue_plugged(q)) {
- int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
- - q->in_flight;
+ int nrq = q->rq_data.count[BLK_RW_SYNC] +
+ q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
if (nrq >= q->unplug_thresh)
__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
struct sg_io_hdr;
#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 256 /* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ 64 /* Default maximum */
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ /* Default maximum */
+#endif
struct request;
typedef void (rq_end_io_fn)(struct request *, int);
struct request_list {
/*
- * count[], starved[], and wait[] are indexed by
+ * count[], starved and wait[] are indexed by
* BLK_RW_SYNC/BLK_RW_ASYNC
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
+ int starved;
+ /*
+ * Global list for starved tasks. A task will be queued here if
+ * it could not allocate request descriptor and the associated
+ * group request list does not have any requests pending.
+ */
+ wait_queue_head_t starved_wait;
};
/*
@@ -253,6 +283,7 @@ struct request {
#ifdef CONFIG_GROUP_IOSCHED
/* io group request belongs to */
struct io_group *iog;
+ struct request_list *rl;
#endif /* GROUP_IOSCHED */
#endif /* ELV_FAIR_QUEUING */
};
@@ -342,6 +373,9 @@ struct request_queue
*/
struct request_list rq;
+ /* Contains request pool and other data like starved data */
+ struct request_data rq_data;
+
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ /* Max # of per io group requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return io_group_get_request_list(q, bio);
+#else
+ return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+ struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return rq->rl;
+#else
+ return blk_get_request_list(q, NULL);
+#endif
+}
+
/*
* Temporary export, until SCSI gets fixed up.
*/
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 17/18] io-controller: IO group refcounting support
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (29 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (6 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o In the original BFQ patch once a cgroup is being deleted, it will clean
up the associated io groups immediately and if there are any active io
queues with that group, these will be moved to root group. This movement
of queues is not good from fairness perspective as one can then create
a cgroup, dump lots of IO and then delete the cgroup and then potentially
get higher share. Apart from there are more issues hence it was felt that
we need a io group refcounting mechanism also so that io group can be
reclaimed asynchronously.
o This is a crude patch to implement io group refcounting. This is still
work in progress and Nauman and Divyesh are playing with more ideas.
o I can do basic cgroup creation, deletion, task movement operations and
there are no crashes (As was reported with V1 by Gui). Though I have not
verified that io groups are actually being freed. Will do it next.
o There are couple of hard to hit race conditions I am aware of. Will fix
that in upcoming versions. (RCU lookup when group might be going away
during cgroup deletion).
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/cfq-iosched.c | 16 ++-
block/elevator-fq.c | 441 ++++++++++++++++++++++++++++++++++-----------------
block/elevator-fq.h | 26 ++--
3 files changed, 320 insertions(+), 163 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
if (sync_cfqq != NULL) {
__iog = cfqq_to_io_group(sync_cfqq);
- if (iog != __iog)
- io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ /*
+ * Drop reference to sync queue. A new queue sync queue will
+ * be assigned in new group upon arrival of a fresh request.
+ * If old queue has got requests, those reuests will be
+ * dispatched over a period of time and queue will be freed
+ * automatically.
+ */
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
}
spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
+
+ /* ioq reference on iog */
+ elv_get_iog(iog);
cfq_log_cfqq(cfqd, cfqq, "alloced");
}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
int force);
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
{
BUG_ON(sd->next_active != entity);
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ struct io_group *iog = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data != NULL)
+ iog = container_of(entity, struct io_group, entity);
+ return iog;
+}
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
struct io_entity *entity)
{
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ return NULL;
+}
#endif
/*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
struct io_entity *entity)
{
struct rb_node *next;
- struct io_queue *ioq = io_entity_to_ioq(entity);
BUG_ON(entity->tree != &st->idle);
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
}
bfq_extract(&st->idle, entity);
-
- /* Delete queue from idle list */
- if (ioq)
- list_del(&ioq->queue_list);
}
/**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
+ struct io_group *iog = io_entity_to_iog(entity);
if (ioq)
elv_get_ioq(ioq);
+ else
+ elv_get_iog(iog);
}
/**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
{
struct io_entity *first_idle = st->first_idle;
struct io_entity *last_idle = st->last_idle;
- struct io_queue *ioq = io_entity_to_ioq(entity);
if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
st->last_idle = entity;
bfq_insert(&st->idle, entity);
-
- /* Add this queue to idle list */
- if (ioq)
- list_add(&ioq->queue_list, &ioq->efqd->idle_list);
}
/**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
struct io_entity *entity)
{
struct io_queue *ioq = NULL;
+ struct io_group *iog = NULL;
BUG_ON(!entity->on_st);
entity->on_st = 0;
st->wsum -= entity->weight;
+
ioq = io_entity_to_ioq(entity);
- if (!ioq)
+ if (ioq) {
+ elv_put_ioq(ioq);
return;
- elv_put_ioq(ioq);
+ }
+
+ iog = io_entity_to_iog(entity);
+ if (iog)
+ elv_put_iog(iog);
}
/**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
/*
* Release all the io group references to its async queues.
*/
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
{
int i, j;
for (i = 0; i < 2; i++)
for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
+ elv_release_ioq(&iog->async_queue[i][j]);
/* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
+ elv_release_ioq(&iog->async_idle_queue);
#ifdef CONFIG_GROUP_IOSCHED
/* Optimization for io schedulers having single ioq */
- if (elv_iosched_single_ioq(e))
- elv_release_ioq(e, &iog->ioq);
+ if (iog->ioq)
+ elv_release_ioq(&iog->ioq);
#endif
}
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
entity = &iog->entity;
entity->parent = parent->my_entity;
entity->sched_data = &parent->sched_data;
+ if (entity->parent)
+ /* Child group reference on parent group */
+ elv_get_iog(parent);
}
/**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
if (!iog)
goto cleanup;
+ atomic_set(&iog->ref, 0);
+ iog->deleting = 0;
+
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
rcu_assign_pointer(leaf->key, key);
hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ /* io_cgroup reference on io group */
+ elv_get_iog(leaf);
+
hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+ /* elevator reference on io group */
+ elv_get_iog(leaf);
spin_unlock_irqrestore(&iocg->lock, flags);
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
if (!iocg)
return &io_root_cgroup;
+ /*
+ * If this cgroup io_cgroup is being deleted, map the bio to
+ * root cgroup
+ */
+ if (css_is_removed(&iocg->css))
+ return &io_root_cgroup;
+
return iocg;
}
/*
* Find the io group bio belongs to.
* If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
*/
struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
spin_lock_irq(&iocg->lock);
hlist_del_rcu(&iog->group_node);
spin_unlock_irq(&iocg->lock);
- io_put_io_group_queues(e, iog);
- kfree(iog);
+ io_put_io_group_queues(iog);
+ elv_put_iog(iog);
}
struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ atomic_set(&iog->ref, 0);
+
blk_init_request_list(&iog->rl);
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
+ /* elevator reference. */
+ elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
spin_unlock_irq(&iocg->lock);
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
}
/*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
*/
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
{
- int busy, resume;
- struct io_entity *entity = &ioq->entity;
- struct elv_fq_data *efqd = &e->efqd;
- struct io_service_tree *st = io_entity_service_tree(entity);
+ int i;
+ struct io_service_tree *st;
- busy = elv_ioq_busy(ioq);
- resume = !!ioq->nr_queued;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ if (!RB_EMPTY_ROOT(&st->active))
+ return 1;
+ }
- BUG_ON(resume && !entity->on_st);
- BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+ return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+ int queue_lock_held)
+{
+ int i;
+ struct io_service_tree *st;
/*
- * We could be moving an queue which is on idle tree of previous group
- * What to do? I guess anyway this queue does not have any requests.
- * just forget the entity and free up from idle tree.
- *
- * This needs cleanup. Hackish.
+ * If we are here then we got the queue lock if group was still on
+ * elevator list. If group had already been disconnected from elevator
+ * list, then we don't need the queue lock.
*/
- if (entity->tree == &st->idle) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
- bfq_put_idle_entity(st, entity);
- }
- if (busy) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
-
- if (!resume)
- elv_del_ioq_busy(e, ioq, 0);
- else
- elv_deactivate_ioq(efqd, ioq, 0);
- }
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
/*
- * Here we use a reference to bfqg. We don't need a refcounter
- * as the cgroup reference will not be dropped, so that its
- * destroy() callback will not be invoked.
+ * Mark io group for deletion so that no new entry goes in
+ * idle tree. Any active queue will be removed from active
+ * tree and not put in to idle tree.
*/
- entity->parent = iog->my_entity;
- entity->sched_data = &iog->sched_data;
+ iog->deleting = 1;
- if (busy && resume)
- elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+ /* Flush idle tree. */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
- struct elevator_queue *eq;
- struct io_entity *entity = iog->my_entity;
- struct io_service_tree *st;
- int i;
+ /*
+ * Drop io group reference on all async queues. This group is
+ * going away so once these queues are empty, free those up
+ * instead of keeping these around in the hope that new IO
+ * will come.
+ *
+ * Note: If this group is disconnected from elevator, elevator
+ * switch must have already done it.
+ */
- eq = container_of(efqd, struct elevator_queue, efqd);
- hlist_del(&iog->elv_data_node);
- __bfq_deactivate_entity(entity, 0);
- io_put_io_group_queues(eq, iog);
+ io_put_io_group_queues(iog);
- for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
- st = iog->sched_data.service_tree + i;
+ if (!io_group_has_active_entities(iog)) {
+ /*
+ * io group does not have any active entites. Because this
+ * group has been decoupled from io_cgroup list and this
+ * cgroup is being deleted, this group should not receive
+ * any new IO. Hence it should be safe to deactivate this
+ * io group and remove from the scheduling tree.
+ */
+ __bfq_deactivate_entity(iog->my_entity, 0);
/*
- * The idle tree may still contain bfq_queues belonging
- * to exited task because they never migrated to a different
- * cgroup from the one being destroyed now. Noone else
- * can access them so it's safe to act without any lock.
+ * Because this io group does not have any active entities,
+ * it should be safe to remove it from elevator list and
+ * drop elvator reference so that upon dropping io_cgroup
+ * reference, this io group should be freed and we don't
+ * wait for elevator switch to happen to free the group
+ * up.
*/
- io_flush_idle_tree(st);
+ if (queue_lock_held) {
+ hlist_del(&iog->elv_data_node);
+ rcu_assign_pointer(iog->key, NULL);
+ /*
+ * Drop iog reference taken by elevator
+ * (efqd->group_list)
+ */
+ elv_put_iog(iog);
+ }
- BUG_ON(!RB_EMPTY_ROOT(&st->active));
- BUG_ON(!RB_EMPTY_ROOT(&st->idle));
}
- BUG_ON(iog->sched_data.next_active != NULL);
- BUG_ON(iog->sched_data.active_entity != NULL);
- BUG_ON(entity->tree != NULL);
+ /* Drop iocg reference on io group */
+ elv_put_iog(iog);
}
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- struct elv_fq_data *efqd = NULL;
- unsigned long uninitialized_var(flags);
-
- /* Remove io group from cgroup list */
- hlist_del(&iog->group_node);
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+ unsigned long flags;
+ int queue_lock_held = 0;
+ struct elv_fq_data *efqd;
/*
* io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
* try to free up async queues again or flush the idle tree.
*/
- rcu_read_lock();
- efqd = rcu_dereference(iog->key);
- if (efqd != NULL) {
- spin_lock_irqsave(efqd->queue->queue_lock, flags);
- if (iog->key == efqd)
- __io_destroy_group(efqd, iog);
- spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
- }
- rcu_read_unlock();
-
- /*
- * No need to defer the kfree() to the end of the RCU grace
- * period: we are called from the destroy() callback of our
- * cgroup, so we can be sure that noone is a) still using
- * this cgroup or b) doing lookups in it.
- */
- kfree(iog);
-}
+retry:
+ spin_lock_irqsave(&iocg->lock, flags);
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+ /* Take the group queue lock */
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ if (spin_trylock_irq(efqd->queue->queue_lock)) {
+ if (iog->key == efqd) {
+ queue_lock_held = 1;
+ rcu_read_unlock();
+ goto locked;
+ }
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
- struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
- struct hlist_node *n, *tmp;
- struct io_group *iog;
+ /*
+ * After acquiring the queue lock, we found
+ * iog->key==NULL, that means elevator switch
+ * completed, group is no longer connected on
+ * elevator hence we can proceed safely without
+ * queue lock.
+ */
+ spin_unlock_irq(efqd->queue->queue_lock);
+ } else {
+ /*
+ * Did not get the queue lock while trying.
+ * Backout. Drop iocg->lock and try again
+ */
+ rcu_read_unlock();
+ spin_unlock_irqrestore(&iocg->lock, flags);
+ udelay(100);
+ goto retry;
- /*
- * Since we are destroying the cgroup, there are no more tasks
- * referencing it, and all the RCU grace periods that may have
- * referenced it are ended (as the destruction of the parent
- * cgroup is RCU-safe); bgrp->group_data will not be accessed by
- * anything else and we don't need any synchronization.
- */
- hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
- io_destroy_group(iocg, iog);
+ }
+ }
+ /*
+ * We come here when iog->key==NULL, that means elevator switch
+ * has already taken place and now this group is no more
+ * connected on elevator and one does not have to have a
+ * queue lock to do the cleanup.
+ */
+ rcu_read_unlock();
+locked:
+ __iocg_destroy(iocg, iog, queue_lock_held);
+ if (queue_lock_held) {
+ spin_unlock_irq(efqd->queue->queue_lock);
+ queue_lock_held = 0;
+ }
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);
BUG_ON(!hlist_empty(&iocg->group_data));
kfree(iocg);
}
+/* Should be called with queue lock held */
void io_disconnect_groups(struct elevator_queue *e)
{
struct hlist_node *pos, *n;
struct io_group *iog;
struct elv_fq_data *efqd = &e->efqd;
+ int i;
+ struct io_service_tree *st;
hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
elv_data_node) {
- hlist_del(&iog->elv_data_node);
-
+ /*
+ * At this point of time group should be on idle tree. This
+ * would extract the group from idle tree.
+ */
__bfq_deactivate_entity(iog->my_entity, 0);
+ /* Flush all the idle trees of the group */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
+
+ /*
+ * This has to be here also apart from cgroup cleanup path
+ * and the reason being that if async queue reference of the
+ * group are not dropped, then async ioq as well as associated
+ * queue will not be reclaimed. Apart from that async cfqq
+ * has to be cleaned up before elevator goes away.
+ */
+ io_put_io_group_queues(iog);
+
/*
* Don't remove from the group hash, just set an
* invalid key. No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
* implies also that new elements cannot be added
* to the list.
*/
+ hlist_del(&iog->elv_data_node);
rcu_assign_pointer(iog->key, NULL);
- io_put_io_group_queues(e, iog);
+ /* Drop iog reference taken by elevator (efqd->group_list)*/
+ elv_put_iog(iog);
}
}
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+ It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+ struct io_service_tree *st;
+ struct io_entity *entity = iog->my_entity;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ BUG_ON(st->wsum != 0);
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity != NULL && entity->tree != NULL);
+
+ kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+ struct io_group *parent = NULL;
+
+ BUG_ON(!iog);
+
+ BUG_ON(atomic_read(&iog->ref) <= 0);
+ if (!atomic_dec_and_test(&iog->ref))
+ return;
+
+ BUG_ON(iog->entity.on_st);
+
+ if (iog->my_entity)
+ parent = container_of(iog->my_entity->parent,
+ struct io_group, entity);
+ io_group_cleanup(iog);
+
+ if (parent)
+ elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
struct cgroup_subsys io_subsys = {
.name = "io",
.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
+ /* ioq reference on iog */
+ elv_get_iog(iog);
}
if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
+ io_put_io_group_queues(iog);
kfree(iog);
}
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
}
EXPORT_SYMBOL(elv_put_ioq);
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
{
- struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
- io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue)
{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
if (ioq == efqd->active_queue)
elv_reset_active_ioq(efqd);
+ /*
+ * The io group ioq belongs to is going away. Don't requeue the
+ * ioq on idle tree. Free it.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+ if (iog->deleting == 1)
+ requeue = 0;
+#endif
bfq_deactivate_entity(&ioq->entity, requeue);
}
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
}
}
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
- struct io_queue *ioq, *n;
- struct elv_fq_data *efqd = &e->efqd;
-
- list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
- elv_deactivate_ioq(efqd, ioq, 0);
-}
-
/*
* Call iosched to let that elevator wants to expire the queue. This gives
* iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
- INIT_LIST_HEAD(&efqd->idle_list);
INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
elv_shutdown_timer_wq(e);
spin_lock_irq(q->queue_lock);
- /* This should drop all the idle tree references of ioq */
- elv_free_idle_ioq_list(e);
- /* This should drop all the io group references of async queues */
+ /*
+ * This should drop all the references of async queues taken by
+ * io group.
+ *
+ * Also should should deactivate the group and extract from the
+ * idle tree. (group can not be on active tree now after the
+ * elevator has been drained).
+ *
+ * Should flush idle tree of the group which inturn will drop
+ * ioq reference taken by active/idle tree.
+ *
+ * Drop the iog reference taken by elevator.
+ */
io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
/* Pointer to generic elevator data structure */
struct elv_fq_data *efqd;
- struct list_head queue_list;
pid_t pid;
/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
* o All the other fields are protected by the @bfqd queue lock.
*/
struct io_group {
+ atomic_t ref;
struct io_entity entity;
struct hlist_node elv_data_node;
struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
/* request list associated with the group */
struct request_list rl;
+
+ /* io group is going away */
+ int deleting;
};
/**
@@ -279,9 +282,6 @@ struct elv_fq_data {
/* List of io groups hanging on this elevator */
struct hlist_head group_list;
- /* List of io queues on idle tree. */
- struct list_head idle_list;
-
struct request_queue *queue;
unsigned int busy_queues;
/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
#ifdef CONFIG_GROUP_IOSCHED
extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
extern struct request_list *io_group_get_request_list(struct request_queue *q,
struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
{
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
return rq->iog;
}
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
{
+ atomic_inc(&iog->ref);
}
+#else /* !GROUP_IOSCHED */
static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
return NULL;
}
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 17/18] io-controller: IO group refcounting support
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (30 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
[not found] ` <1241553525-28095-18-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
` (5 subsequent siblings)
37 siblings, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o In the original BFQ patch once a cgroup is being deleted, it will clean
up the associated io groups immediately and if there are any active io
queues with that group, these will be moved to root group. This movement
of queues is not good from fairness perspective as one can then create
a cgroup, dump lots of IO and then delete the cgroup and then potentially
get higher share. Apart from there are more issues hence it was felt that
we need a io group refcounting mechanism also so that io group can be
reclaimed asynchronously.
o This is a crude patch to implement io group refcounting. This is still
work in progress and Nauman and Divyesh are playing with more ideas.
o I can do basic cgroup creation, deletion, task movement operations and
there are no crashes (As was reported with V1 by Gui). Though I have not
verified that io groups are actually being freed. Will do it next.
o There are couple of hard to hit race conditions I am aware of. Will fix
that in upcoming versions. (RCU lookup when group might be going away
during cgroup deletion).
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/cfq-iosched.c | 16 ++-
block/elevator-fq.c | 441 ++++++++++++++++++++++++++++++++++-----------------
block/elevator-fq.h | 26 ++--
3 files changed, 320 insertions(+), 163 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
if (sync_cfqq != NULL) {
__iog = cfqq_to_io_group(sync_cfqq);
- if (iog != __iog)
- io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ /*
+ * Drop reference to sync queue. A new queue sync queue will
+ * be assigned in new group upon arrival of a fresh request.
+ * If old queue has got requests, those reuests will be
+ * dispatched over a period of time and queue will be freed
+ * automatically.
+ */
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
}
spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
+
+ /* ioq reference on iog */
+ elv_get_iog(iog);
cfq_log_cfqq(cfqd, cfqq, "alloced");
}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
int force);
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
{
BUG_ON(sd->next_active != entity);
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ struct io_group *iog = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data != NULL)
+ iog = container_of(entity, struct io_group, entity);
+ return iog;
+}
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
struct io_entity *entity)
{
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ return NULL;
+}
#endif
/*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
struct io_entity *entity)
{
struct rb_node *next;
- struct io_queue *ioq = io_entity_to_ioq(entity);
BUG_ON(entity->tree != &st->idle);
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
}
bfq_extract(&st->idle, entity);
-
- /* Delete queue from idle list */
- if (ioq)
- list_del(&ioq->queue_list);
}
/**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
+ struct io_group *iog = io_entity_to_iog(entity);
if (ioq)
elv_get_ioq(ioq);
+ else
+ elv_get_iog(iog);
}
/**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
{
struct io_entity *first_idle = st->first_idle;
struct io_entity *last_idle = st->last_idle;
- struct io_queue *ioq = io_entity_to_ioq(entity);
if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
st->last_idle = entity;
bfq_insert(&st->idle, entity);
-
- /* Add this queue to idle list */
- if (ioq)
- list_add(&ioq->queue_list, &ioq->efqd->idle_list);
}
/**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
struct io_entity *entity)
{
struct io_queue *ioq = NULL;
+ struct io_group *iog = NULL;
BUG_ON(!entity->on_st);
entity->on_st = 0;
st->wsum -= entity->weight;
+
ioq = io_entity_to_ioq(entity);
- if (!ioq)
+ if (ioq) {
+ elv_put_ioq(ioq);
return;
- elv_put_ioq(ioq);
+ }
+
+ iog = io_entity_to_iog(entity);
+ if (iog)
+ elv_put_iog(iog);
}
/**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
/*
* Release all the io group references to its async queues.
*/
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
{
int i, j;
for (i = 0; i < 2; i++)
for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
+ elv_release_ioq(&iog->async_queue[i][j]);
/* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
+ elv_release_ioq(&iog->async_idle_queue);
#ifdef CONFIG_GROUP_IOSCHED
/* Optimization for io schedulers having single ioq */
- if (elv_iosched_single_ioq(e))
- elv_release_ioq(e, &iog->ioq);
+ if (iog->ioq)
+ elv_release_ioq(&iog->ioq);
#endif
}
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
entity = &iog->entity;
entity->parent = parent->my_entity;
entity->sched_data = &parent->sched_data;
+ if (entity->parent)
+ /* Child group reference on parent group */
+ elv_get_iog(parent);
}
/**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
if (!iog)
goto cleanup;
+ atomic_set(&iog->ref, 0);
+ iog->deleting = 0;
+
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
rcu_assign_pointer(leaf->key, key);
hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ /* io_cgroup reference on io group */
+ elv_get_iog(leaf);
+
hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+ /* elevator reference on io group */
+ elv_get_iog(leaf);
spin_unlock_irqrestore(&iocg->lock, flags);
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
if (!iocg)
return &io_root_cgroup;
+ /*
+ * If this cgroup io_cgroup is being deleted, map the bio to
+ * root cgroup
+ */
+ if (css_is_removed(&iocg->css))
+ return &io_root_cgroup;
+
return iocg;
}
/*
* Find the io group bio belongs to.
* If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
*/
struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
spin_lock_irq(&iocg->lock);
hlist_del_rcu(&iog->group_node);
spin_unlock_irq(&iocg->lock);
- io_put_io_group_queues(e, iog);
- kfree(iog);
+ io_put_io_group_queues(iog);
+ elv_put_iog(iog);
}
struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ atomic_set(&iog->ref, 0);
+
blk_init_request_list(&iog->rl);
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
+ /* elevator reference. */
+ elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
spin_unlock_irq(&iocg->lock);
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
}
/*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
*/
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
{
- int busy, resume;
- struct io_entity *entity = &ioq->entity;
- struct elv_fq_data *efqd = &e->efqd;
- struct io_service_tree *st = io_entity_service_tree(entity);
+ int i;
+ struct io_service_tree *st;
- busy = elv_ioq_busy(ioq);
- resume = !!ioq->nr_queued;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ if (!RB_EMPTY_ROOT(&st->active))
+ return 1;
+ }
- BUG_ON(resume && !entity->on_st);
- BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+ return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+ int queue_lock_held)
+{
+ int i;
+ struct io_service_tree *st;
/*
- * We could be moving an queue which is on idle tree of previous group
- * What to do? I guess anyway this queue does not have any requests.
- * just forget the entity and free up from idle tree.
- *
- * This needs cleanup. Hackish.
+ * If we are here then we got the queue lock if group was still on
+ * elevator list. If group had already been disconnected from elevator
+ * list, then we don't need the queue lock.
*/
- if (entity->tree == &st->idle) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
- bfq_put_idle_entity(st, entity);
- }
- if (busy) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
-
- if (!resume)
- elv_del_ioq_busy(e, ioq, 0);
- else
- elv_deactivate_ioq(efqd, ioq, 0);
- }
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
/*
- * Here we use a reference to bfqg. We don't need a refcounter
- * as the cgroup reference will not be dropped, so that its
- * destroy() callback will not be invoked.
+ * Mark io group for deletion so that no new entry goes in
+ * idle tree. Any active queue will be removed from active
+ * tree and not put in to idle tree.
*/
- entity->parent = iog->my_entity;
- entity->sched_data = &iog->sched_data;
+ iog->deleting = 1;
- if (busy && resume)
- elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+ /* Flush idle tree. */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
- struct elevator_queue *eq;
- struct io_entity *entity = iog->my_entity;
- struct io_service_tree *st;
- int i;
+ /*
+ * Drop io group reference on all async queues. This group is
+ * going away so once these queues are empty, free those up
+ * instead of keeping these around in the hope that new IO
+ * will come.
+ *
+ * Note: If this group is disconnected from elevator, elevator
+ * switch must have already done it.
+ */
- eq = container_of(efqd, struct elevator_queue, efqd);
- hlist_del(&iog->elv_data_node);
- __bfq_deactivate_entity(entity, 0);
- io_put_io_group_queues(eq, iog);
+ io_put_io_group_queues(iog);
- for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
- st = iog->sched_data.service_tree + i;
+ if (!io_group_has_active_entities(iog)) {
+ /*
+ * io group does not have any active entites. Because this
+ * group has been decoupled from io_cgroup list and this
+ * cgroup is being deleted, this group should not receive
+ * any new IO. Hence it should be safe to deactivate this
+ * io group and remove from the scheduling tree.
+ */
+ __bfq_deactivate_entity(iog->my_entity, 0);
/*
- * The idle tree may still contain bfq_queues belonging
- * to exited task because they never migrated to a different
- * cgroup from the one being destroyed now. Noone else
- * can access them so it's safe to act without any lock.
+ * Because this io group does not have any active entities,
+ * it should be safe to remove it from elevator list and
+ * drop elvator reference so that upon dropping io_cgroup
+ * reference, this io group should be freed and we don't
+ * wait for elevator switch to happen to free the group
+ * up.
*/
- io_flush_idle_tree(st);
+ if (queue_lock_held) {
+ hlist_del(&iog->elv_data_node);
+ rcu_assign_pointer(iog->key, NULL);
+ /*
+ * Drop iog reference taken by elevator
+ * (efqd->group_list)
+ */
+ elv_put_iog(iog);
+ }
- BUG_ON(!RB_EMPTY_ROOT(&st->active));
- BUG_ON(!RB_EMPTY_ROOT(&st->idle));
}
- BUG_ON(iog->sched_data.next_active != NULL);
- BUG_ON(iog->sched_data.active_entity != NULL);
- BUG_ON(entity->tree != NULL);
+ /* Drop iocg reference on io group */
+ elv_put_iog(iog);
}
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- struct elv_fq_data *efqd = NULL;
- unsigned long uninitialized_var(flags);
-
- /* Remove io group from cgroup list */
- hlist_del(&iog->group_node);
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+ unsigned long flags;
+ int queue_lock_held = 0;
+ struct elv_fq_data *efqd;
/*
* io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
* try to free up async queues again or flush the idle tree.
*/
- rcu_read_lock();
- efqd = rcu_dereference(iog->key);
- if (efqd != NULL) {
- spin_lock_irqsave(efqd->queue->queue_lock, flags);
- if (iog->key == efqd)
- __io_destroy_group(efqd, iog);
- spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
- }
- rcu_read_unlock();
-
- /*
- * No need to defer the kfree() to the end of the RCU grace
- * period: we are called from the destroy() callback of our
- * cgroup, so we can be sure that noone is a) still using
- * this cgroup or b) doing lookups in it.
- */
- kfree(iog);
-}
+retry:
+ spin_lock_irqsave(&iocg->lock, flags);
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+ /* Take the group queue lock */
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ if (spin_trylock_irq(efqd->queue->queue_lock)) {
+ if (iog->key == efqd) {
+ queue_lock_held = 1;
+ rcu_read_unlock();
+ goto locked;
+ }
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
- struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
- struct hlist_node *n, *tmp;
- struct io_group *iog;
+ /*
+ * After acquiring the queue lock, we found
+ * iog->key==NULL, that means elevator switch
+ * completed, group is no longer connected on
+ * elevator hence we can proceed safely without
+ * queue lock.
+ */
+ spin_unlock_irq(efqd->queue->queue_lock);
+ } else {
+ /*
+ * Did not get the queue lock while trying.
+ * Backout. Drop iocg->lock and try again
+ */
+ rcu_read_unlock();
+ spin_unlock_irqrestore(&iocg->lock, flags);
+ udelay(100);
+ goto retry;
- /*
- * Since we are destroying the cgroup, there are no more tasks
- * referencing it, and all the RCU grace periods that may have
- * referenced it are ended (as the destruction of the parent
- * cgroup is RCU-safe); bgrp->group_data will not be accessed by
- * anything else and we don't need any synchronization.
- */
- hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
- io_destroy_group(iocg, iog);
+ }
+ }
+ /*
+ * We come here when iog->key==NULL, that means elevator switch
+ * has already taken place and now this group is no more
+ * connected on elevator and one does not have to have a
+ * queue lock to do the cleanup.
+ */
+ rcu_read_unlock();
+locked:
+ __iocg_destroy(iocg, iog, queue_lock_held);
+ if (queue_lock_held) {
+ spin_unlock_irq(efqd->queue->queue_lock);
+ queue_lock_held = 0;
+ }
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);
BUG_ON(!hlist_empty(&iocg->group_data));
kfree(iocg);
}
+/* Should be called with queue lock held */
void io_disconnect_groups(struct elevator_queue *e)
{
struct hlist_node *pos, *n;
struct io_group *iog;
struct elv_fq_data *efqd = &e->efqd;
+ int i;
+ struct io_service_tree *st;
hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
elv_data_node) {
- hlist_del(&iog->elv_data_node);
-
+ /*
+ * At this point of time group should be on idle tree. This
+ * would extract the group from idle tree.
+ */
__bfq_deactivate_entity(iog->my_entity, 0);
+ /* Flush all the idle trees of the group */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
+
+ /*
+ * This has to be here also apart from cgroup cleanup path
+ * and the reason being that if async queue reference of the
+ * group are not dropped, then async ioq as well as associated
+ * queue will not be reclaimed. Apart from that async cfqq
+ * has to be cleaned up before elevator goes away.
+ */
+ io_put_io_group_queues(iog);
+
/*
* Don't remove from the group hash, just set an
* invalid key. No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
* implies also that new elements cannot be added
* to the list.
*/
+ hlist_del(&iog->elv_data_node);
rcu_assign_pointer(iog->key, NULL);
- io_put_io_group_queues(e, iog);
+ /* Drop iog reference taken by elevator (efqd->group_list)*/
+ elv_put_iog(iog);
}
}
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+ It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+ struct io_service_tree *st;
+ struct io_entity *entity = iog->my_entity;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ BUG_ON(st->wsum != 0);
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity != NULL && entity->tree != NULL);
+
+ kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+ struct io_group *parent = NULL;
+
+ BUG_ON(!iog);
+
+ BUG_ON(atomic_read(&iog->ref) <= 0);
+ if (!atomic_dec_and_test(&iog->ref))
+ return;
+
+ BUG_ON(iog->entity.on_st);
+
+ if (iog->my_entity)
+ parent = container_of(iog->my_entity->parent,
+ struct io_group, entity);
+ io_group_cleanup(iog);
+
+ if (parent)
+ elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
struct cgroup_subsys io_subsys = {
.name = "io",
.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
+ /* ioq reference on iog */
+ elv_get_iog(iog);
}
if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
+ io_put_io_group_queues(iog);
kfree(iog);
}
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
}
EXPORT_SYMBOL(elv_put_ioq);
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
{
- struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
- io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue)
{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
if (ioq == efqd->active_queue)
elv_reset_active_ioq(efqd);
+ /*
+ * The io group ioq belongs to is going away. Don't requeue the
+ * ioq on idle tree. Free it.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+ if (iog->deleting == 1)
+ requeue = 0;
+#endif
bfq_deactivate_entity(&ioq->entity, requeue);
}
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
}
}
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
- struct io_queue *ioq, *n;
- struct elv_fq_data *efqd = &e->efqd;
-
- list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
- elv_deactivate_ioq(efqd, ioq, 0);
-}
-
/*
* Call iosched to let that elevator wants to expire the queue. This gives
* iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
- INIT_LIST_HEAD(&efqd->idle_list);
INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
elv_shutdown_timer_wq(e);
spin_lock_irq(q->queue_lock);
- /* This should drop all the idle tree references of ioq */
- elv_free_idle_ioq_list(e);
- /* This should drop all the io group references of async queues */
+ /*
+ * This should drop all the references of async queues taken by
+ * io group.
+ *
+ * Also should should deactivate the group and extract from the
+ * idle tree. (group can not be on active tree now after the
+ * elevator has been drained).
+ *
+ * Should flush idle tree of the group which inturn will drop
+ * ioq reference taken by active/idle tree.
+ *
+ * Drop the iog reference taken by elevator.
+ */
io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
/* Pointer to generic elevator data structure */
struct elv_fq_data *efqd;
- struct list_head queue_list;
pid_t pid;
/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
* o All the other fields are protected by the @bfqd queue lock.
*/
struct io_group {
+ atomic_t ref;
struct io_entity entity;
struct hlist_node elv_data_node;
struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
/* request list associated with the group */
struct request_list rl;
+
+ /* io group is going away */
+ int deleting;
};
/**
@@ -279,9 +282,6 @@ struct elv_fq_data {
/* List of io groups hanging on this elevator */
struct hlist_head group_list;
- /* List of io queues on idle tree. */
- struct list_head idle_list;
-
struct request_queue *queue;
unsigned int busy_queues;
/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
#ifdef CONFIG_GROUP_IOSCHED
extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
extern struct request_list *io_group_get_request_list(struct request_queue *q,
struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
{
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
return rq->iog;
}
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
{
+ atomic_inc(&iog->ref);
}
+#else /* !GROUP_IOSCHED */
static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
return NULL;
}
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (31 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (4 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando
Cc: akpm, vgoyal
o Littile debugging aid for hierarchical IO scheduling.
o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
o Currently it outputs more debug messages in blktrace output which helps
a great deal in debugging in hierarchical setup.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 10 +++-
block/elevator-fq.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 6 ++
3 files changed, 141 insertions(+), 6 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
request, original owner of the bio is decided by using io tracking
patches otherwise we continue to attribute the request to the
submitting thread.
-endmenu
+config DEBUG_GROUP_IOSCHED
+ bool "Debug Hierarchical Scheduling support"
+ depends on CGROUPS && GROUP_IOSCHED
+ default n
+ ---help---
+ Enable some debugging hooks for hierarchical scheduling support.
+ Currently it just outputs more information in blktrace output.
+
+endmenu
endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
- { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+ { RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
iog = container_of(entity, struct io_group, entity);
return iog;
}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+ unsigned short id = iog->iocg_id;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+
+ if (!id)
+ goto out;
+
+ css = css_lookup(&io_subsys, id);
+ if (!css)
+ goto out;
+
+ if (!css_tryget(css))
+ goto out;
+
+ cgroup_path(css->cgroup, buf, buflen);
+
+ css_put(css);
+
+ rcu_read_unlock();
+ return;
+out:
+ rcu_read_unlock();
+ buf[0] = '\0';
+ return;
+}
+#endif
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
struct rb_node *node = &entity->rb_node;
bfq_insert(&st->active, entity);
-
+ st->nr_active++;
if (node->rb_left != NULL)
node = node->rb_left;
else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
node = bfq_find_deepest(&entity->rb_node);
bfq_extract(&st->active, entity);
-
+ st->nr_active--;
if (node != NULL)
bfq_update_active_tree(node);
}
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
blk_init_request_list(&iog->rl);
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
/* elevator reference. */
elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
spin_unlock_irq(&iocg->lock);
return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
.destroy = iocg_destroy,
.populate = iocg_populate,
.subsys_id = io_subsys_id,
+ .use_id = 1,
};
/*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ struct elv_fq_data *efqd = ioq->efqd;
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+ " QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d",
+ served, ioq->nr_sectors,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
/* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
if (ioq) {
struct io_group *iog = ioq_to_io_group(ioq);
+
elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
- " weight=%ld group_weight=%ld",
+ " weight=%ld rq_queued=%d group_weight=%ld",
efqd->busy_queues,
ioq->entity.ioprio, ioq->entity.weight,
- iog_weight(iog));
+ ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_service_tree *grpst;
+ int nr_active = 0;
+ if (iog != efqd->root_group) {
+ grpst = io_entity_service_tree(
+ &iog->entity);
+ nr_active = grpst->nr_active;
+ }
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+ " nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d", path, nr_active,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
efqd->busy_queues++;
if (elv_ioq_class_rt(ioq))
efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+ "ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
BUG_ON(!elv_ioq_busy(ioq));
BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+ "QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+ "rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#else
elv_log_ioq(efqd, ioq, "del from busy");
+#endif
elv_clear_ioq_busy(ioq);
BUG_ON(efqd->busy_queues == 0);
efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
elv_ioq_update_io_thinktime(ioq);
elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ io_group_path(rq_iog(q, rq), path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+ "rq_queued=%d", path, ioq->nr_queued);
+ }
+#endif
if (ioq == elv_active_ioq(q->elevator)) {
/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
struct rb_root active;
struct rb_root idle;
+ int nr_active;
+
struct io_entity *first_idle;
struct io_entity *last_idle;
@@ -245,6 +247,10 @@ struct io_group {
/* io group is going away */
int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ unsigned short iocg_id;
+#endif
};
/**
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (32 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-06 21:40 ` IKEDA, Munehiro
[not found] ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (3 subsequent siblings)
37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
o Littile debugging aid for hierarchical IO scheduling.
o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
o Currently it outputs more debug messages in blktrace output which helps
a great deal in debugging in hierarchical setup.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 10 +++-
block/elevator-fq.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 6 ++
3 files changed, 141 insertions(+), 6 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
request, original owner of the bio is decided by using io tracking
patches otherwise we continue to attribute the request to the
submitting thread.
-endmenu
+config DEBUG_GROUP_IOSCHED
+ bool "Debug Hierarchical Scheduling support"
+ depends on CGROUPS && GROUP_IOSCHED
+ default n
+ ---help---
+ Enable some debugging hooks for hierarchical scheduling support.
+ Currently it just outputs more information in blktrace output.
+
+endmenu
endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
- { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+ { RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
iog = container_of(entity, struct io_group, entity);
return iog;
}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+ unsigned short id = iog->iocg_id;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+
+ if (!id)
+ goto out;
+
+ css = css_lookup(&io_subsys, id);
+ if (!css)
+ goto out;
+
+ if (!css_tryget(css))
+ goto out;
+
+ cgroup_path(css->cgroup, buf, buflen);
+
+ css_put(css);
+
+ rcu_read_unlock();
+ return;
+out:
+ rcu_read_unlock();
+ buf[0] = '\0';
+ return;
+}
+#endif
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
struct rb_node *node = &entity->rb_node;
bfq_insert(&st->active, entity);
-
+ st->nr_active++;
if (node->rb_left != NULL)
node = node->rb_left;
else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
node = bfq_find_deepest(&entity->rb_node);
bfq_extract(&st->active, entity);
-
+ st->nr_active--;
if (node != NULL)
bfq_update_active_tree(node);
}
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
blk_init_request_list(&iog->rl);
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
/* elevator reference. */
elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
spin_unlock_irq(&iocg->lock);
return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
.destroy = iocg_destroy,
.populate = iocg_populate,
.subsys_id = io_subsys_id,
+ .use_id = 1,
};
/*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ struct elv_fq_data *efqd = ioq->efqd;
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+ " QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d",
+ served, ioq->nr_sectors,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
/* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
if (ioq) {
struct io_group *iog = ioq_to_io_group(ioq);
+
elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
- " weight=%ld group_weight=%ld",
+ " weight=%ld rq_queued=%d group_weight=%ld",
efqd->busy_queues,
ioq->entity.ioprio, ioq->entity.weight,
- iog_weight(iog));
+ ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_service_tree *grpst;
+ int nr_active = 0;
+ if (iog != efqd->root_group) {
+ grpst = io_entity_service_tree(
+ &iog->entity);
+ nr_active = grpst->nr_active;
+ }
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+ " nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d", path, nr_active,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
efqd->busy_queues++;
if (elv_ioq_class_rt(ioq))
efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+ "ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
BUG_ON(!elv_ioq_busy(ioq));
BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+ "QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+ "rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#else
elv_log_ioq(efqd, ioq, "del from busy");
+#endif
elv_clear_ioq_busy(ioq);
BUG_ON(efqd->busy_queues == 0);
efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
elv_ioq_update_io_thinktime(ioq);
elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ io_group_path(rq_iog(q, rq), path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+ "rq_queued=%d", path, ioq->nr_queued);
+ }
+#endif
if (ioq == elv_active_ioq(q->elevator)) {
/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
struct rb_root active;
struct rb_root idle;
+ int nr_active;
+
struct io_entity *first_idle;
struct io_entity *last_idle;
@@ -245,6 +247,10 @@ struct io_group {
/* io group is going away */
int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ unsigned short iocg_id;
+#endif
};
/**
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-06 21:40 ` IKEDA, Munehiro
[not found] ` <4A0203DB.1090809-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
[not found] ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 1 reply; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 21:40 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, akpm
Hi Vivek,
Patching and compilation error occurred on the 18/18 patch.
I know this is a patch for debug but report them in case.
Vivek Goyal wrote:
> @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
> void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
> {
> entity_served(&ioq->entity, served, ioq->nr_sectors);
Patch failed due to this line. I guess this should be
entity_served(&ioq->entity, served);
> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> + {
> + struct elv_fq_data *efqd = ioq->efqd;
> + char path[128];
> + struct io_group *iog = ioq_to_io_group(ioq);
> + io_group_path(iog, path, sizeof(path));
> + elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> + " QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> + " GTs=0x%lx rq_queued=%d",
> + served, ioq->nr_sectors,
> + ioq->entity.total_service,
> + ioq->entity.total_sector_service,
> + path,
> + iog->entity.total_service,
> + iog->entity.total_sector_service,
> + ioq->nr_queued);
> + }
> +#endif
> }
Because
io_entity::total_service
and
io_entity::total_sector_service
are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
here. (and everywhere referencing entity.total_service or entity.total_sector_service)
They need to be defined like:
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 1ea4ff3..6d0a735 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,10 @@ struct io_entity {
unsigned short ioprio_class, new_ioprio_class;
int ioprio_changed;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ unsigned long total_service, total_sector_service;
+#endif
};
/*
Unfortunately I couldn't figure out where and how the members
should be calculated, sorry.
--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda@ds.jp.nec.com
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
[not found] ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:40 ` IKEDA, Munehiro
0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 21:40 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Hi Vivek,
Patching and compilation error occurred on the 18/18 patch.
I know this is a patch for debug but report them in case.
Vivek Goyal wrote:
> @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
> void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
> {
> entity_served(&ioq->entity, served, ioq->nr_sectors);
Patch failed due to this line. I guess this should be
entity_served(&ioq->entity, served);
> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> + {
> + struct elv_fq_data *efqd = ioq->efqd;
> + char path[128];
> + struct io_group *iog = ioq_to_io_group(ioq);
> + io_group_path(iog, path, sizeof(path));
> + elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> + " QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> + " GTs=0x%lx rq_queued=%d",
> + served, ioq->nr_sectors,
> + ioq->entity.total_service,
> + ioq->entity.total_sector_service,
> + path,
> + iog->entity.total_service,
> + iog->entity.total_sector_service,
> + ioq->nr_queued);
> + }
> +#endif
> }
Because
io_entity::total_service
and
io_entity::total_sector_service
are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
here. (and everywhere referencing entity.total_service or entity.total_sector_service)
They need to be defined like:
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 1ea4ff3..6d0a735 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,10 @@ struct io_entity {
unsigned short ioprio_class, new_ioprio_class;
int ioprio_changed;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ unsigned long total_service, total_sector_service;
+#endif
};
/*
Unfortunately I couldn't figure out where and how the members
should be calculated, sorry.
--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH 01/18] io-controller: Documentation
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (20 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 264 +++++++++++++++++++++++++++++++++
2 files changed, 266 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+ IO Controller
+ =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+ --------------------------------
+ | Elevator Layer + Fair Queuing |
+ --------------------------------
+ | | | |
+ NOOP DEADLINE AS CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+ hierarchical mode. One of the things is that we can not keep dividing
+ the time slice of parent group among childrens. Deeper we go in hierarchy
+ time slice will get smaller.
+
+ One of the ways to implement hierarchical support could be to keep track
+ of virtual time and service provided to queue/group and select a queue/group
+ for service based on any of the various available algoriths.
+
+ BFQ already had support for hierarchical scheduling, taking those patches
+ was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+ to a queue. Delay/Jitter with BFQ is O(1).
+
+ Note: BFQ originally used amount of IO done (number of sectors) as notion
+ of service provided. IOW, it tried to provide fairness in terms of
+ actual IO done and not in terms of actual time disk access was
+ given to a queue.
+
+ This patcheset modified BFQ to provide fairness in time domain because
+ that's what CFQ does. So idea was try not to deviate too much from
+ the CFQ behavior initially.
+
+ Providing fairness in time domain makes accounting trciky because
+ due to command queueing, at one time there might be multiple requests
+ from different queues and there is no easy way to find out how much
+ disk time actually was consumed by the requests of a particular
+ queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ hierarchical.
+
+CGROUP_BLKIO
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+ - Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+ benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+ are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+ and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+ complicated and maintains lots of data structures. Need to spend some time
+ to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+ the block devices in the system. Probably it will make more sense to
+ make it per cgroup per device setting so that a cgroup can have different
+ weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+ CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+ CONFIG_TRACK_ASYNC_CONTEXT=y
+
+ (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ controller.
+
+ mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.ioprio
+ echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+
+ echo 1 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/lv0/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/lv0/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+ (It is hard to achieve though :-)).
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (36 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.
This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 13 +
block/Makefile | 1 +
block/blk-sysfs.c | 25 +
block/elevator-fq.c | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 488 +++++++++++
block/elevator.c | 46 +-
include/linux/blkdev.h | 5 +
include/linux/elevator.h | 51 ++
8 files changed, 2694 insertions(+), 11 deletions(-)
create mode 100644 block/elevator-fq.c
create mode 100644 block/elevator-fq.h
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
menu "IO Schedulers"
+config ELV_FAIR_QUEUING
+ bool "Elevator Fair Queuing Support"
+ default n
+ ---help---
+ Traditionally only cfq had notion of multiple queues and it did
+ fair queuing at its own. With the cgroups and need of controlling
+ IO, now even the simple io schedulers like noop, deadline, as will
+ have one queue per cgroup and will need hierarchical fair queuing.
+ Instead of every io scheduler implementing its own fair queuing
+ logic, this option enables fair queuing in elevator layer so that
+ other ioschedulers can make use of it.
+ If unsure, say N.
+
config IOSCHED_NOOP
bool
default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING) += elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
.store = queue_iostats_store,
};
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+ .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_idle_show,
+ .store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+ .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_sync_show,
+ .store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+ .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_async_show,
+ .store = elv_slice_async_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ &queue_slice_idle_entry.attr,
+ &queue_slice_sync_entry.attr,
+ &queue_slice_async_entry.attr,
+#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ * Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE (5)
+#define ELV_HW_QUEUE_MIN (5)
+#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
+ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+ unsigned short prio)
+{
+ const int base_slice = efqd->elv_slice[sync];
+
+ WARN_ON(prio >= IOPRIO_BE_NR);
+
+ return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations. This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT 22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+ return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+ bfq_weight_t weight)
+{
+ bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+ do_div(d, weight);
+ return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+ bfq_service_t service)
+{
+ BUG_ON(entity->weight == 0);
+
+ entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data == NULL)
+ ioq = container_of(entity, struct io_queue, entity);
+ return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity. This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+ struct io_entity *entity = NULL;
+
+ if (node != NULL)
+ entity = rb_entry(node, struct io_entity, rb_node);
+
+ return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+ BUG_ON(entity->tree != root);
+
+ entity->tree = NULL;
+ rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *next;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ BUG_ON(entity->tree != &st->idle);
+
+ if (entity == st->first_idle) {
+ next = rb_next(&entity->rb_node);
+ st->first_idle = bfq_entity_of(next);
+ }
+
+ if (entity == st->last_idle) {
+ next = rb_prev(&entity->rb_node);
+ st->last_idle = bfq_entity_of(next);
+ }
+
+ bfq_extract(&st->idle, entity);
+
+ /* Delete queue from idle list */
+ if (ioq)
+ list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+ struct io_entity *entry;
+ struct rb_node **node = &root->rb_node;
+ struct rb_node *parent = NULL;
+
+ BUG_ON(entity->tree != NULL);
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb_entry(parent, struct io_entity, rb_node);
+
+ if (bfq_gt(entry->finish, entity->finish))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, root);
+
+ entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree. The function assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+ struct rb_node *node)
+{
+ struct io_entity *child;
+
+ if (node != NULL) {
+ child = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entity->min_start, child->min_start))
+ entity->min_start = child->min_start;
+ }
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value. The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+ struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+ entity->min_start = entity->start;
+ bfq_update_min(entity, node->rb_right);
+ bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update. This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root. The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+ struct rb_node *parent;
+
+up:
+ bfq_update_active_node(node);
+
+ parent = rb_parent(node);
+ if (parent == NULL)
+ return;
+
+ if (node == parent->rb_left && parent->rb_right != NULL)
+ bfq_update_active_node(parent->rb_right);
+ else if (parent->rb_left != NULL)
+ bfq_update_active_node(parent->rb_left);
+
+ node = parent;
+ goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node = &entity->rb_node;
+
+ bfq_insert(&st->active, entity);
+
+ if (node->rb_left != NULL)
+ node = node->rb_left;
+ else if (node->rb_right != NULL)
+ node = node->rb_right;
+
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (ioq)
+ elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch. If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+ struct rb_node *deepest;
+
+ if (node->rb_right == NULL && node->rb_left == NULL)
+ deepest = rb_parent(node);
+ else if (node->rb_right == NULL)
+ deepest = node->rb_left;
+ else if (node->rb_left == NULL)
+ deepest = node->rb_right;
+ else {
+ deepest = rb_next(node);
+ if (deepest->rb_right != NULL)
+ deepest = deepest->rb_right;
+ else if (rb_parent(deepest) != node)
+ deepest = rb_parent(deepest);
+ }
+
+ return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node;
+
+ node = bfq_find_deepest(&entity->rb_node);
+ bfq_extract(&st->active, entity);
+
+ if (node != NULL)
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+ st->first_idle = entity;
+ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+ st->last_idle = entity;
+
+ bfq_insert(&st->idle, entity);
+
+ /* Add this queue to idle list */
+ if (ioq)
+ list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue. Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(!entity->on_st);
+ entity->on_st = 0;
+ st->wsum -= entity->weight;
+ ioq = io_entity_to_ioq(entity);
+ if (!ioq)
+ return;
+ elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ bfq_idle_extract(st, entity);
+ bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+
+ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+ !bfq_gt(last_idle->finish, st->vtime)) {
+ /*
+ * Active tree is empty. Pull back vtime to finish time of
+ * last idle entity on idle tree.
+ * Rational seems to be that it reduces the possibility of
+ * vtime wraparound (bfq_gt(V-F) < 0).
+ */
+ st->vtime = last_idle->finish;
+ }
+
+ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+ bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+ struct io_entity *entity)
+{
+ struct io_service_tree *new_st = old_st;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (entity->ioprio_changed) {
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->ioprio_changed = 0;
+
+ /*
+ * Also update the scaled budget for ioq. Group will get the
+ * updated budget once ioq is selected to run next.
+ */
+ if (ioq) {
+ struct elv_fq_data *efqd = ioq->efqd;
+ entity->budget = elv_prio_to_slice(efqd, ioq);
+ }
+
+ old_st->wsum -= entity->weight;
+ entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+ /*
+ * NOTE: here we may be changing the weight too early,
+ * this will cause unfairness. The correct approach
+ * would have required additional complexity to defer
+ * weight changes to the proper time instants (i.e.,
+ * when entity->finish <= old_st->vtime).
+ */
+ new_st = io_entity_service_tree(entity);
+ new_st->wsum += entity->weight;
+
+ if (new_st != old_st)
+ entity->start = new_st->vtime;
+ }
+
+ return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion. It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ if (entity == sd->active_entity) {
+ BUG_ON(entity->tree != NULL);
+ /*
+ * If we are requeueing the current entity we have
+ * to take care of not charging to it service it has
+ * not received.
+ */
+ bfq_calc_finish(entity, entity->service);
+ entity->start = entity->finish;
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active) {
+ /*
+ * Requeueing an entity due to a change of some
+ * next_active entity below it. We reuse the old
+ * start time.
+ */
+ bfq_active_extract(st, entity);
+ } else if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq_idle_extract() will
+ * check for that.
+ */
+ bfq_idle_extract(st, entity);
+ entity->start = bfq_gt(st->vtime, entity->finish) ?
+ st->vtime : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = st->vtime;
+ st->wsum += entity->weight;
+ bfq_get_entity(entity);
+
+ BUG_ON(entity->on_st);
+ entity->on_st = 1;
+ }
+
+ st = __bfq_entity_update_prio(st, entity);
+ /*
+ * This is to emulate cfq like functionality where preemption can
+ * happen with-in same class, like sync queue preempting async queue
+ * May be this is not a very good idea from fairness point of view
+ * as preempting queue gains share. Keeping it for now.
+ */
+ if (add_front) {
+ struct io_entity *next_entity;
+
+ /*
+ * Determine the entity which will be dispatched next
+ * Use sd->next_active once hierarchical patch is applied
+ */
+ next_entity = bfq_lookup_next_entity(sd, 0);
+
+ if (next_entity && next_entity != entity) {
+ struct io_service_tree *new_st;
+ bfq_timestamp_t delta;
+
+ new_st = io_entity_service_tree(next_entity);
+
+ /*
+ * At this point, both entities should belong to
+ * same service tree as cross service tree preemption
+ * is automatically taken care by algorithm
+ */
+ BUG_ON(new_st != st);
+ entity->finish = next_entity->finish - 1;
+ delta = bfq_delta(entity->budget, entity->weight);
+ entity->start = entity->finish - delta;
+ if (bfq_gt(entity->start, st->vtime))
+ entity->start = st->vtime;
+ }
+ } else {
+ bfq_calc_finish(entity, entity->budget);
+ }
+ bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ __bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state. If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+ int was_active = entity == sd->active_entity;
+ int ret = 0;
+
+ if (!entity->on_st)
+ return 0;
+
+ BUG_ON(was_active && entity->tree != NULL);
+
+ if (was_active) {
+ bfq_calc_finish(entity, entity->service);
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active)
+ bfq_active_extract(st, entity);
+ else if (entity->tree == &st->idle)
+ bfq_idle_extract(st, entity);
+ else if (entity->tree != NULL)
+ BUG();
+
+ if (!requeue || !bfq_gt(entity->finish, st->vtime))
+ bfq_forget_entity(st, entity);
+ else
+ bfq_idle_insert(st, entity);
+
+ BUG_ON(sd->active_entity == entity);
+
+ return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ __bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time. Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+ struct io_entity *entry;
+ struct rb_node *node = st->active.rb_node;
+
+ entry = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entry->min_start, st->vtime)) {
+ st->vtime = entry->min_start;
+ bfq_forget_idle(st);
+ }
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity. The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+ struct io_entity *entry, *first = NULL;
+ struct rb_node *node = st->active.rb_node;
+
+ while (node != NULL) {
+ entry = rb_entry(node, struct io_entity, rb_node);
+left:
+ if (!bfq_gt(entry->start, st->vtime))
+ first = entry;
+
+ BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+ if (node->rb_left != NULL) {
+ entry = rb_entry(node->rb_left,
+ struct io_entity, rb_node);
+ if (!bfq_gt(entry->min_start, st->vtime)) {
+ node = node->rb_left;
+ goto left;
+ }
+ }
+ if (first != NULL)
+ break;
+ node = node->rb_right;
+ }
+
+ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+ return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+ struct io_entity *entity;
+
+ if (RB_EMPTY_ROOT(&st->active))
+ return NULL;
+
+ bfq_update_vtime(st);
+ entity = bfq_first_active_entity(st);
+ BUG_ON(bfq_gt(entity->start, st->vtime));
+
+ return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract)
+{
+ struct io_service_tree *st = sd->service_tree;
+ struct io_entity *entity;
+ int i;
+
+ /*
+ * One can check for which will be next selected entity without
+ * expiring the current one.
+ */
+ BUG_ON(extract && sd->active_entity != NULL);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ entity = __bfq_lookup_next_entity(st);
+ if (entity != NULL) {
+ if (extract) {
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ }
+ break;
+ }
+ }
+
+ return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+ struct io_service_tree *st;
+
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+ return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+ return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+ return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+ return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+ return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+ entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = jiffies_to_msecs(efqd->elv_slice_idle);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ data = msecs_to_jiffies(data);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice_idle = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[1];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[1] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[0];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[0] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (elv_nr_busy_ioq(q->elevator)) {
+ elv_log(efqd, "schedule dispatch");
+ kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+ }
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+ struct elv_fq_data *efqd =
+ container_of(work, struct elv_fq_data, unplug_work);
+ struct request_queue *q = efqd->queue;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_start_queueing(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+ del_timer_sync(&e->efqd.idle_slice_timer);
+ cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ ioq->slice_end = jiffies + ioq->entity.budget;
+ elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ unsigned long elapsed = jiffies - ioq->last_end_request;
+ unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+ ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+ ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+ ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+ struct io_queue *ioq, struct request *rq)
+{
+ int old_idle, enable_idle;
+ struct elv_fq_data *efqd = ioq->efqd;
+
+ /*
+ * Don't idle for async or idle io prio class
+ */
+ if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+ return;
+
+ enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+ if (!efqd->elv_slice_idle)
+ enable_idle = 0;
+ else if (ioq_sample_valid(ioq->ttime_samples)) {
+ if (ioq->ttime_mean > efqd->elv_slice_idle)
+ enable_idle = 0;
+ else
+ enable_idle = 1;
+ }
+
+ /*
+ * From think time perspective idle should be enabled. Check with
+ * io scheduler if it wants to disable idling based on additional
+ * considrations like seek pattern.
+ */
+ if (enable_idle) {
+ if (eq->ops->elevator_update_idle_window_fn)
+ enable_idle = eq->ops->elevator_update_idle_window_fn(
+ eq, ioq->sched_queue, rq);
+ if (!enable_idle)
+ elv_log_ioq(efqd, ioq, "iosched disabled idle");
+ }
+
+ if (old_idle != enable_idle) {
+ elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+ if (enable_idle)
+ elv_mark_ioq_idle_window(ioq);
+ else
+ elv_clear_ioq_idle_window(ioq);
+ }
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+ struct io_queue *ioq = NULL;
+
+ ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+ return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+ kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio,
+ int is_sync)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+ struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+ RB_CLEAR_NODE(&ioq->entity.rb_node);
+ atomic_set(&ioq->ref, 0);
+ ioq->efqd = efqd;
+ elv_ioq_set_ioprio_class(ioq, ioprio_class);
+ elv_ioq_set_ioprio(ioq, ioprio);
+ ioq->pid = current->pid;
+ ioq->sched_queue = sched_queue;
+ if (is_sync && !elv_ioq_class_idle(ioq))
+ elv_mark_ioq_idle_window(ioq);
+ bfq_init_entity(&ioq->entity, iog);
+ ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+ return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+ efqd);
+
+ BUG_ON(atomic_read(&ioq->ref) <= 0);
+ if (!atomic_dec_and_test(&ioq->ref))
+ return;
+ BUG_ON(ioq->nr_queued);
+ BUG_ON(ioq->entity.tree != NULL);
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(efqd->active_queue == ioq);
+
+ /* Can be called by outgoing elevator. Don't use q */
+ BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+ e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+ elv_log_ioq(efqd, ioq, "put_queue");
+ elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+ struct io_queue *ioq = *ioq_ptr;
+
+ if (ioq != NULL) {
+ /* Drop the reference taken by the io group */
+ elv_put_ioq(ioq);
+ *ioq_ptr = NULL;
+ }
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ struct io_sched_data *sd = &efqd->root_group->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+ BUG_ON(!efqd->busy_queues);
+ BUG_ON(sd != entity->sched_data);
+ BUG_ON(!st);
+
+ bfq_update_vtime(st);
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ entity->service = 0;
+ elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = NULL;
+ struct io_queue *ioq = NULL;
+ struct io_sched_data *sd;
+
+ /*
+ * one can check for which queue will be selected next while having
+ * one queue active. preempt logic uses it.
+ */
+ BUG_ON(extract && efqd->active_queue != NULL);
+
+ if (!efqd->busy_queues)
+ return NULL;
+
+ sd = &efqd->root_group->sched_data;
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ BUG_ON(!entity);
+ if (extract)
+ entity->service = 0;
+ ioq = io_entity_to_ioq(entity);
+
+ return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int coop)
+{
+ struct request_queue *q = efqd->queue;
+
+ if (ioq) {
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+ efqd->busy_queues);
+ ioq->slice_end = 0;
+
+ elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_mark_ioq_slice_new(ioq);
+
+ del_timer(&efqd->idle_slice_timer);
+ }
+
+ efqd->active_queue = ioq;
+
+ /* Let iosched know if it wants to take some action */
+ if (ioq) {
+ if (q->elevator->ops->elevator_active_ioq_set_fn)
+ q->elevator->ops->elevator_active_ioq_set_fn(q,
+ ioq->sched_queue, coop);
+ }
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ int coop = 0;
+
+ if (!ioq)
+ ioq = elv_get_next_ioq(q, 1);
+ else {
+ elv_set_next_ioq(q, ioq);
+ /*
+ * io scheduler selected the next queue for us. Pass this
+ * this info back to io scheudler. cfq currently uses it
+ * to reset coop flag on the queue.
+ */
+ coop = 1;
+ }
+ __elv_set_active_ioq(efqd, ioq, coop);
+ return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+ struct request_queue *q = efqd->queue;
+ struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+ if (q->elevator->ops->elevator_active_ioq_reset_fn)
+ q->elevator->ops->elevator_active_ioq_reset_fn(q,
+ ioq->sched_queue);
+ efqd->active_queue = NULL;
+ del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+ bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue)
+{
+ if (ioq == efqd->active_queue)
+ elv_reset_active_ioq(efqd);
+
+ bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(ioq == efqd->active_queue);
+ elv_log_ioq(efqd, ioq, "add to busy");
+ elv_activate_ioq(ioq, 0);
+ elv_mark_ioq_busy(ioq);
+ efqd->busy_queues++;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ BUG_ON(!elv_ioq_busy(ioq));
+ BUG_ON(ioq->nr_queued);
+ elv_log_ioq(efqd, ioq, "del from busy");
+ elv_clear_ioq_busy(ioq);
+ BUG_ON(efqd->busy_queues == 0);
+ efqd->busy_queues--;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues--;
+
+ elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+ assert_spin_locked(q->queue_lock);
+ elv_log_ioq(efqd, ioq, "slice expired");
+
+ if (elv_ioq_wait_request(ioq))
+ del_timer(&efqd->idle_slice_timer);
+
+ elv_clear_ioq_wait_request(ioq);
+
+ /*
+ * if ioq->slice_end = 0, that means a queue was expired before first
+ * reuqest from the queue got completed. Of course we are not planning
+ * to idle on the queue otherwise we would not have expired it.
+ *
+ * Charge for the 25% slice in such cases. This is not the best thing
+ * to do but at the same time not very sure what's the next best
+ * thing to do.
+ *
+ * This arises from that fact that we don't have the notion of
+ * one queue being operational at one time. io scheduler can dispatch
+ * requests from multiple queues in one dispatch round. Ideally for
+ * more accurate accounting of exact disk time used by disk, one
+ * should dispatch requests from only one queue and wait for all
+ * the requests to finish. But this will reduce throughput.
+ */
+ if (!ioq->slice_end)
+ slice_used = entity->budget/4;
+ else {
+ if (time_after(ioq->slice_end, jiffies)) {
+ slice_unused = ioq->slice_end - jiffies;
+ if (slice_unused == entity->budget) {
+ /*
+ * queue got expired immediately after
+ * completing first request. Charge 25% of
+ * slice.
+ */
+ slice_used = entity->budget/4;
+ } else
+ slice_used = entity->budget - slice_unused;
+ } else {
+ slice_overshoot = jiffies - ioq->slice_end;
+ slice_used = entity->budget + slice_overshoot;
+ }
+ }
+
+ elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+ jiffies);
+ elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+ slice_used, entity->budget, slice_overshoot);
+ elv_ioq_served(ioq, slice_used);
+
+ BUG_ON(ioq != efqd->active_queue);
+ elv_reset_active_ioq(efqd);
+
+ if (!ioq->nr_queued)
+ elv_del_ioq_busy(q->elevator, ioq, 1);
+ else
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ * Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (ioq)
+ __elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+ struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elevator_queue *eq = q->elevator;
+
+ ioq = elv_active_ioq(eq);
+
+ if (!ioq)
+ return 0;
+
+ if (elv_ioq_slice_used(ioq))
+ return 1;
+
+ if (elv_ioq_class_idle(new_ioq))
+ return 0;
+
+ if (elv_ioq_class_idle(ioq))
+ return 1;
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+ if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+ return 1;
+
+ /*
+ * Check with io scheduler if it has additional criterion based on
+ * which it wants to preempt existing queue.
+ */
+ if (eq->ops->elevator_should_preempt_fn)
+ return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+ return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+ elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+ elv_ioq_slice_expired(q);
+
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
+
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ BUG_ON(!efqd);
+ BUG_ON(!ioq);
+ efqd->rq_queued++;
+ ioq->nr_queued++;
+
+ if (!elv_ioq_busy(ioq))
+ elv_add_ioq_busy(efqd, ioq);
+
+ elv_ioq_update_io_thinktime(ioq);
+ elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+ if (ioq == elv_active_ioq(q->elevator)) {
+ /*
+ * Remember that we saw a request from this process, but
+ * don't start queuing just yet. Otherwise we risk seeing lots
+ * of tiny requests, because we disrupt the normal plugging
+ * and merging. If the request is already larger than a single
+ * page, let it rip immediately. For that case we assume that
+ * merging is already done. Ditto for a busy system that
+ * has other work pending, don't risk delaying until the
+ * idle timer unplug to continue working.
+ */
+ if (elv_ioq_wait_request(ioq)) {
+ if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+ efqd->busy_queues > 1) {
+ del_timer(&efqd->idle_slice_timer);
+ blk_start_queueing(q);
+ }
+ elv_mark_ioq_must_dispatch(ioq);
+ }
+ } else if (elv_should_preempt(q, ioq, rq)) {
+ /*
+ * not the active queue - expire current slice if it is
+ * idle and has expired it's mean thinktime or this new queue
+ * has some old slice time left and is of higher priority or
+ * this new queue is RT and the current one is BE
+ */
+ elv_preempt_queue(q, ioq);
+ blk_start_queueing(q);
+ }
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+ struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+ struct io_queue *ioq;
+ unsigned long flags;
+ struct request_queue *q = efqd->queue;
+
+ elv_log(efqd, "idle timer fired");
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ ioq = efqd->active_queue;
+
+ if (ioq) {
+
+ /*
+ * We saw a request before the queue expired, let it through
+ */
+ if (elv_ioq_must_dispatch(ioq))
+ goto out_kick;
+
+ /*
+ * expired
+ */
+ if (elv_ioq_slice_used(ioq))
+ goto expire;
+
+ /*
+ * only expire and reinvoke request handler, if there are
+ * other queues with pending requests
+ */
+ if (!elv_nr_busy_ioq(q->elevator))
+ goto out_cont;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (ioq->nr_queued)
+ goto out_kick;
+ }
+expire:
+ elv_ioq_slice_expired(q);
+out_kick:
+ elv_schedule_dispatch(q);
+out_cont:
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+ unsigned long sl;
+
+ BUG_ON(!ioq);
+
+ /*
+ * SSD device without seek penalty, disable idling. But only do so
+ * for devices that support queuing, otherwise we still have a problem
+ * with sync vs async workloads.
+ */
+ if (blk_queue_nonrot(q) && efqd->hw_tag)
+ return;
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver)
+ return;
+
+ /*
+ * idle is disabled, either manually or by past process history
+ */
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ return;
+
+ /*
+ * may be iosched got its own idling logic. In that case io
+ * schduler will take care of arming the timer, if need be.
+ */
+ if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ q->elevator->ops->elevator_arm_slice_timer_fn(q,
+ ioq->sched_queue);
+ } else {
+ elv_mark_ioq_wait_request(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu", sl);
+ }
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+ struct io_queue *ioq, *n;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+ elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+ if (!elv_nr_busy_ioq(q->elevator))
+ return NULL;
+
+ if (ioq == NULL)
+ goto new_queue;
+
+ /*
+ * Force dispatch. Continue to dispatch from current queue as long
+ * as it has requests.
+ */
+ if (unlikely(force)) {
+ if (ioq->nr_queued)
+ goto keep_queue;
+ else
+ goto expire;
+ }
+
+ /*
+ * The active queue has run out of time, expire it and select new.
+ */
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+ goto expire;
+
+ /*
+ * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+ * cfqq.
+ */
+ if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+ /*
+ * We simulate this as cfqq timed out so that it gets to bank
+ * the remaining of its time slice.
+ */
+ elv_log_ioq(efqd, ioq, "preempt");
+ goto expire;
+ }
+
+ /*
+ * The active queue has requests and isn't expired, allow it to
+ * dispatch.
+ */
+
+ if (ioq->nr_queued)
+ goto keep_queue;
+
+ /*
+ * If another queue has a request waiting within our mean seek
+ * distance, let it run. The expire code will check for close
+ * cooperators and put the close queue at the front of the service
+ * tree.
+ */
+ new_ioq = elv_close_cooperator(q, ioq, 0);
+ if (new_ioq)
+ goto expire;
+
+ /*
+ * No requests pending. If the active queue still has requests in
+ * flight or is idling for a new request, allow either of these
+ * conditions to happen (or time out) before selecting a new queue.
+ */
+
+ if (timer_pending(&efqd->idle_slice_timer) ||
+ (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
+expire:
+ elv_ioq_slice_expired(q);
+new_queue:
+ ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+ return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elv_fq_data *efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ ioq = rq->ioq;
+ BUG_ON(!ioq);
+ ioq->nr_queued--;
+
+ efqd = ioq->efqd;
+ BUG_ON(!efqd);
+ efqd->rq_queued--;
+
+ if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+ elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ BUG_ON(!ioq);
+ elv_ioq_request_dispatched(ioq);
+ elv_ioq_request_removed(e, rq);
+ elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ efqd->rq_in_driver++;
+ elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ WARN_ON(!efqd->rq_in_driver);
+ efqd->rq_in_driver--;
+ elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+ if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+ efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+ if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+ efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+ return;
+
+ if (efqd->hw_tag_samples++ < 50)
+ return;
+
+ if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+ efqd->hw_tag = 1;
+ else
+ efqd->hw_tag = 0;
+
+ efqd->hw_tag_samples = 0;
+ efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *new_ioq = NULL;
+
+ /*
+ * Currently this feature is supported only for flat hierarchy or
+ * root group queues so that default cfq behavior is not changed.
+ */
+ if (!is_root_group_ioq(q, ioq))
+ return NULL;
+
+ if (q->elevator->ops->elevator_close_cooperator_fn)
+ new_ioq = e->ops->elevator_close_cooperator_fn(q,
+ ioq->sched_queue, probe);
+
+ /* Only select co-operating queue if it belongs to root group */
+ if (new_ioq && !is_root_group_ioq(q, new_ioq))
+ return NULL;
+
+ return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+ const int sync = rq_is_sync(rq);
+ struct io_queue *ioq = rq->ioq;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ elv_log_ioq(efqd, ioq, "complete");
+
+ elv_update_hw_tag(efqd);
+
+ WARN_ON(!efqd->rq_in_driver);
+ WARN_ON(!ioq->dispatched);
+ efqd->rq_in_driver--;
+ ioq->dispatched--;
+
+ if (sync)
+ ioq->last_end_request = jiffies;
+
+ /*
+ * If this is the active queue, check if it needs to be expired,
+ * or if we want to idle in case it has no pending requests.
+ */
+
+ if (elv_active_ioq(q->elevator) == ioq) {
+ if (elv_ioq_slice_new(ioq)) {
+ elv_ioq_set_prio_slice(q, ioq);
+ elv_clear_ioq_slice_new(ioq);
+ }
+ /*
+ * If there are no requests waiting in this queue, and
+ * there are other queues ready to issue requests, AND
+ * those other queues are issuing requests within our
+ * mean seek distance, give them a chance to run instead
+ * of idling.
+ */
+ if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ elv_ioq_slice_expired(q);
+ else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ && sync && !rq_noidle(rq))
+ elv_ioq_arm_slice_timer(q);
+ }
+
+ if (!efqd->rq_in_driver)
+ elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio)
+{
+ struct io_queue *ioq = NULL;
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ ioq = iog->async_queue[0][ioprio];
+ break;
+ case IOPRIO_CLASS_BE:
+ ioq = iog->async_queue[1][ioprio];
+ break;
+ case IOPRIO_CLASS_IDLE:
+ ioq = iog->async_idle_queue;
+ break;
+ default:
+ BUG();
+ }
+
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq)
+{
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ iog->async_queue[0][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_BE:
+ iog->async_queue[1][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_IDLE:
+ iog->async_idle_queue = ioq;
+ break;
+ default:
+ BUG();
+ }
+
+ /*
+ * Take the group reference and pin the queue. Group exit will
+ * clean it up
+ */
+ elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+ /*
+ * Caller already ensured that pending RCU callbacks are completed,
+ * so we should have no busy allocations at this point.
+ */
+ if (elv_ioq_pool)
+ kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+ elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+ if (!elv_ioq_pool)
+ goto fail;
+
+ return 0;
+fail:
+ elv_slab_kill();
+ return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ iog = io_alloc_root_group(q, e, efqd);
+ if (iog == NULL)
+ return 1;
+
+ efqd->root_group = iog;
+ efqd->queue = q;
+
+ init_timer(&efqd->idle_slice_timer);
+ efqd->idle_slice_timer.function = elv_idle_slice_timer;
+ efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+ INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+ INIT_LIST_HEAD(&efqd->idle_list);
+
+ efqd->elv_slice[0] = elv_slice_async;
+ efqd->elv_slice[1] = elv_slice_sync;
+ efqd->elv_slice_idle = elv_slice_idle;
+ efqd->hw_tag = 1;
+
+ return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+ struct request_queue *q = efqd->queue;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+
+ spin_lock_irq(q->queue_lock);
+ /* This should drop all the idle tree references of ioq */
+ elv_free_idle_ioq_list(e);
+ spin_unlock_irq(q->queue_lock);
+
+ elv_shutdown_timer_wq(e);
+
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+ io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+ if (elv_slab_setup())
+ return -ENOMEM;
+
+ /* could be 0 on HZ < 1000 setups */
+
+ if (!elv_slice_async)
+ elv_slice_async = 1;
+
+ if (!elv_slice_idle)
+ elv_slice_idle = 1;
+
+ return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ * Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES 3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree. All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+ struct rb_root active;
+ struct rb_root idle;
+
+ struct io_entity *first_idle;
+ struct io_entity *last_idle;
+
+ bfq_timestamp_t vtime;
+ bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue. It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+ struct io_entity *active_entity;
+ struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ * the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ * this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ * associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ * ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ * ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy. Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now. Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag. As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ. When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+ struct rb_node rb_node;
+
+ int on_st;
+
+ bfq_timestamp_t finish;
+ bfq_timestamp_t start;
+
+ struct rb_root *tree;
+
+ bfq_timestamp_t min_start;
+
+ bfq_service_t service, budget;
+ bfq_weight_t weight;
+
+ struct io_entity *parent;
+
+ struct io_sched_data *my_sched_data;
+ struct io_sched_data *sched_data;
+
+ unsigned short ioprio, new_ioprio;
+ unsigned short ioprio_class, new_ioprio_class;
+
+ int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+ struct io_entity entity;
+ atomic_t ref;
+ unsigned int flags;
+
+ /* Pointer to generic elevator data structure */
+ struct elv_fq_data *efqd;
+ struct list_head queue_list;
+ pid_t pid;
+
+ /* Number of requests queued on this io queue */
+ unsigned long nr_queued;
+
+ /* Requests dispatched from this queue */
+ int dispatched;
+
+ /* Keep a track of think time of processes in this queue */
+ unsigned long last_end_request;
+ unsigned long ttime_total;
+ unsigned long ttime_samples;
+ unsigned long ttime_mean;
+
+ unsigned long slice_end;
+
+ /* Pointer to io scheduler's queue */
+ void *sched_queue;
+};
+
+struct io_group {
+ struct io_sched_data sched_data;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+ struct io_group *root_group;
+
+ /* List of io queues on idle tree. */
+ struct list_head idle_list;
+
+ struct request_queue *queue;
+ unsigned int busy_queues;
+ /*
+ * Used to track any pending rt requests so we can pre-empt current
+ * non-RT cfqq in service when this value is non-zero.
+ */
+ unsigned int busy_rt_queues;
+
+ /* Number of requests queued */
+ int rq_queued;
+
+ /* Pointer to the ioscheduler queue being served */
+ void *active_queue;
+
+ int rq_in_driver;
+ int hw_tag;
+ int hw_tag_samples;
+ int rq_in_driver_peak;
+
+ /*
+ * elevator fair queuing layer has the capability to provide idling
+ * for ensuring fairness for processes doing dependent reads.
+ * This might be needed to ensure fairness among two processes doing
+ * synchronous reads in two different cgroups. noop and deadline don't
+ * have any notion of anticipation/idling. As of now, these are the
+ * users of this functionality.
+ */
+ unsigned int elv_slice_idle;
+ struct timer_list idle_slice_timer;
+ struct work_struct unplug_work;
+
+ unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid, \
+ elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples) ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+ ELV_QUEUE_FLAG_busy = 0, /* has requests or is under service */
+ ELV_QUEUE_FLAG_sync, /* synchronous queue */
+ ELV_QUEUE_FLAG_idle_window, /* elevator slice idling enabled */
+ ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
+ ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
+ ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name) \
+static inline void elv_mark_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline int elv_ioq_##name(struct io_queue *ioq) \
+{ \
+ return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0; \
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+ struct io_sched_data *sched_data = entity->sched_data;
+ unsigned int idx = entity->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+ BUG_ON(sched_data == NULL);
+
+ return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+ ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+ if (elv_ioq_slice_new(ioq))
+ return 0;
+ if (time_before(jiffies, ioq->slice_end))
+ return 0;
+
+ return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+ return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+ return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+ return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+ return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+ return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+ atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+ unsigned long slice_end)
+{
+ ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+ int ioprio_class)
+{
+ ioq->entity.new_ioprio_class = ioprio_class;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+ ioq->entity.new_ioprio = ioprio;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+ return container_of(ioq->entity.sched_data, struct io_group,
+ sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+ struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+ unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+ struct elevator_queue *e)
+{
+ return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
for (i = 0; i < ELV_HASH_ENTRIES; i++)
INIT_HLIST_HEAD(&eq->hash[i]);
+ if (elv_init_fq_data(q, eq))
+ goto err;
+
return eq;
err:
kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
e->ops = NULL;
+ elv_exit_fq_data_post(e);
mutex_unlock(&e->sysfs_lock);
kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_activate_rq(q, rq);
+
if (e->ops->elevator_activate_req_fn)
e->ops->elevator_activate_req_fn(q, rq);
}
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_deactivate_rq(q, rq);
+
if (e->ops->elevator_deactivate_req_fn)
e->ops->elevator_deactivate_req_fn(q, rq);
}
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
boundary = q->end_sector;
stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
q->end_sector = rq_end_sector(rq);
q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
elv_rqhash_del(q, next);
q->nr_sorted--;
+ elv_ioq_request_removed(e, next);
q->last_merge = rq;
}
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->last_merge = rq;
}
- /*
- * Some ioscheds (cfq) run q->request_fn directly, so
- * rq cannot be accessed after calling
- * elevator_add_req_fn.
- */
q->elevator->ops->elevator_add_req_fn(q, rq);
+ elv_ioq_request_add(q, rq);
break;
case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
int elv_queue_empty(struct request_queue *q)
{
- struct elevator_queue *e = q->elevator;
-
if (!list_empty(&q->queue_head))
return 0;
- if (e->ops->elevator_queue_empty_fn)
- return e->ops->elevator_queue_empty_fn(q);
+ /* Hopefully nr_sorted works and no need to call queue_empty_fn */
+ if (q->nr_sorted)
+ return 0;
return 1;
}
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
*/
if (blk_account_rq(rq)) {
q->in_flight--;
- if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
- e->ops->elevator_completed_req_fn(q, rq);
+ if (blk_sorted_rq(rq)) {
+ if (e->ops->elevator_completed_req_fn)
+ e->ops->elevator_completed_req_fn(q, rq);
+ elv_ioq_completed_request(q, rq);
+ }
}
/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
return NULL;
}
EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+ return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+ return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
/* for bidi */
struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* io queue request belongs to */
+ struct io_queue *ioq;
+#endif
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
#define _LINUX_ELEVATOR_H
#include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
#ifdef CONFIG_BLOCK
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+ struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+ struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+ void*, int probe);
+#endif
struct elevator_ops
{
@@ -56,6 +69,17 @@ struct elevator_ops
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+ elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+ elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+ elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+ elevator_should_preempt_fn *elevator_should_preempt_fn;
+ elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+ elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
};
#define ELV_NAME_MAX (16)
@@ -76,6 +100,9 @@ struct elevator_type
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ int elevator_features;
+#endif
};
/*
@@ -89,6 +116,10 @@ struct elevator_queue
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* fair queuing data */
+ struct elv_fq_data efqd;
+#endif
};
/*
@@ -209,5 +240,25 @@ enum {
__val; \
})
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define ELV_IOSCHED_NEED_FQ 1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
@ 2009-05-05 19:58 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.
This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/Kconfig.iosched | 13 +
block/Makefile | 1 +
block/blk-sysfs.c | 25 +
block/elevator-fq.c | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 488 +++++++++++
block/elevator.c | 46 +-
include/linux/blkdev.h | 5 +
include/linux/elevator.h | 51 ++
8 files changed, 2694 insertions(+), 11 deletions(-)
create mode 100644 block/elevator-fq.c
create mode 100644 block/elevator-fq.h
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
menu "IO Schedulers"
+config ELV_FAIR_QUEUING
+ bool "Elevator Fair Queuing Support"
+ default n
+ ---help---
+ Traditionally only cfq had notion of multiple queues and it did
+ fair queuing at its own. With the cgroups and need of controlling
+ IO, now even the simple io schedulers like noop, deadline, as will
+ have one queue per cgroup and will need hierarchical fair queuing.
+ Instead of every io scheduler implementing its own fair queuing
+ logic, this option enables fair queuing in elevator layer so that
+ other ioschedulers can make use of it.
+ If unsure, say N.
+
config IOSCHED_NOOP
bool
default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING) += elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
.store = queue_iostats_store,
};
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+ .attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_idle_show,
+ .store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+ .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_sync_show,
+ .store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+ .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_slice_async_show,
+ .store = elv_slice_async_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ &queue_slice_idle_entry.attr,
+ &queue_slice_sync_entry.attr,
+ &queue_slice_async_entry.attr,
+#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE (5)
+#define ELV_HW_QUEUE_MIN (5)
+#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
+ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+ unsigned short prio)
+{
+ const int base_slice = efqd->elv_slice[sync];
+
+ WARN_ON(prio >= IOPRIO_BE_NR);
+
+ return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations. This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT 22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+ return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+ bfq_weight_t weight)
+{
+ bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+ do_div(d, weight);
+ return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+ bfq_service_t service)
+{
+ BUG_ON(entity->weight == 0);
+
+ entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data == NULL)
+ ioq = container_of(entity, struct io_queue, entity);
+ return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity. This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+ struct io_entity *entity = NULL;
+
+ if (node != NULL)
+ entity = rb_entry(node, struct io_entity, rb_node);
+
+ return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+ BUG_ON(entity->tree != root);
+
+ entity->tree = NULL;
+ rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *next;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ BUG_ON(entity->tree != &st->idle);
+
+ if (entity == st->first_idle) {
+ next = rb_next(&entity->rb_node);
+ st->first_idle = bfq_entity_of(next);
+ }
+
+ if (entity == st->last_idle) {
+ next = rb_prev(&entity->rb_node);
+ st->last_idle = bfq_entity_of(next);
+ }
+
+ bfq_extract(&st->idle, entity);
+
+ /* Delete queue from idle list */
+ if (ioq)
+ list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+ struct io_entity *entry;
+ struct rb_node **node = &root->rb_node;
+ struct rb_node *parent = NULL;
+
+ BUG_ON(entity->tree != NULL);
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb_entry(parent, struct io_entity, rb_node);
+
+ if (bfq_gt(entry->finish, entity->finish))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, root);
+
+ entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree. The function assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+ struct rb_node *node)
+{
+ struct io_entity *child;
+
+ if (node != NULL) {
+ child = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entity->min_start, child->min_start))
+ entity->min_start = child->min_start;
+ }
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value. The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+ struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+ entity->min_start = entity->start;
+ bfq_update_min(entity, node->rb_right);
+ bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update. This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root. The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+ struct rb_node *parent;
+
+up:
+ bfq_update_active_node(node);
+
+ parent = rb_parent(node);
+ if (parent == NULL)
+ return;
+
+ if (node == parent->rb_left && parent->rb_right != NULL)
+ bfq_update_active_node(parent->rb_right);
+ else if (parent->rb_left != NULL)
+ bfq_update_active_node(parent->rb_left);
+
+ node = parent;
+ goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node = &entity->rb_node;
+
+ bfq_insert(&st->active, entity);
+
+ if (node->rb_left != NULL)
+ node = node->rb_left;
+ else if (node->rb_right != NULL)
+ node = node->rb_right;
+
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (ioq)
+ elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch. If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+ struct rb_node *deepest;
+
+ if (node->rb_right == NULL && node->rb_left == NULL)
+ deepest = rb_parent(node);
+ else if (node->rb_right == NULL)
+ deepest = node->rb_left;
+ else if (node->rb_left == NULL)
+ deepest = node->rb_right;
+ else {
+ deepest = rb_next(node);
+ if (deepest->rb_right != NULL)
+ deepest = deepest->rb_right;
+ else if (rb_parent(deepest) != node)
+ deepest = rb_parent(deepest);
+ }
+
+ return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct rb_node *node;
+
+ node = bfq_find_deepest(&entity->rb_node);
+ bfq_extract(&st->active, entity);
+
+ if (node != NULL)
+ bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+ st->first_idle = entity;
+ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+ st->last_idle = entity;
+
+ bfq_insert(&st->idle, entity);
+
+ /* Add this queue to idle list */
+ if (ioq)
+ list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue. Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ struct io_queue *ioq = NULL;
+
+ BUG_ON(!entity->on_st);
+ entity->on_st = 0;
+ st->wsum -= entity->weight;
+ ioq = io_entity_to_ioq(entity);
+ if (!ioq)
+ return;
+ elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+ struct io_entity *entity)
+{
+ bfq_idle_extract(st, entity);
+ bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+ struct io_entity *first_idle = st->first_idle;
+ struct io_entity *last_idle = st->last_idle;
+
+ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+ !bfq_gt(last_idle->finish, st->vtime)) {
+ /*
+ * Active tree is empty. Pull back vtime to finish time of
+ * last idle entity on idle tree.
+ * Rational seems to be that it reduces the possibility of
+ * vtime wraparound (bfq_gt(V-F) < 0).
+ */
+ st->vtime = last_idle->finish;
+ }
+
+ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+ bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+ struct io_entity *entity)
+{
+ struct io_service_tree *new_st = old_st;
+ struct io_queue *ioq = io_entity_to_ioq(entity);
+
+ if (entity->ioprio_changed) {
+ entity->ioprio = entity->new_ioprio;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->ioprio_changed = 0;
+
+ /*
+ * Also update the scaled budget for ioq. Group will get the
+ * updated budget once ioq is selected to run next.
+ */
+ if (ioq) {
+ struct elv_fq_data *efqd = ioq->efqd;
+ entity->budget = elv_prio_to_slice(efqd, ioq);
+ }
+
+ old_st->wsum -= entity->weight;
+ entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+ /*
+ * NOTE: here we may be changing the weight too early,
+ * this will cause unfairness. The correct approach
+ * would have required additional complexity to defer
+ * weight changes to the proper time instants (i.e.,
+ * when entity->finish <= old_st->vtime).
+ */
+ new_st = io_entity_service_tree(entity);
+ new_st->wsum += entity->weight;
+
+ if (new_st != old_st)
+ entity->start = new_st->vtime;
+ }
+
+ return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion. It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ if (entity == sd->active_entity) {
+ BUG_ON(entity->tree != NULL);
+ /*
+ * If we are requeueing the current entity we have
+ * to take care of not charging to it service it has
+ * not received.
+ */
+ bfq_calc_finish(entity, entity->service);
+ entity->start = entity->finish;
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active) {
+ /*
+ * Requeueing an entity due to a change of some
+ * next_active entity below it. We reuse the old
+ * start time.
+ */
+ bfq_active_extract(st, entity);
+ } else if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq_idle_extract() will
+ * check for that.
+ */
+ bfq_idle_extract(st, entity);
+ entity->start = bfq_gt(st->vtime, entity->finish) ?
+ st->vtime : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = st->vtime;
+ st->wsum += entity->weight;
+ bfq_get_entity(entity);
+
+ BUG_ON(entity->on_st);
+ entity->on_st = 1;
+ }
+
+ st = __bfq_entity_update_prio(st, entity);
+ /*
+ * This is to emulate cfq like functionality where preemption can
+ * happen with-in same class, like sync queue preempting async queue
+ * May be this is not a very good idea from fairness point of view
+ * as preempting queue gains share. Keeping it for now.
+ */
+ if (add_front) {
+ struct io_entity *next_entity;
+
+ /*
+ * Determine the entity which will be dispatched next
+ * Use sd->next_active once hierarchical patch is applied
+ */
+ next_entity = bfq_lookup_next_entity(sd, 0);
+
+ if (next_entity && next_entity != entity) {
+ struct io_service_tree *new_st;
+ bfq_timestamp_t delta;
+
+ new_st = io_entity_service_tree(next_entity);
+
+ /*
+ * At this point, both entities should belong to
+ * same service tree as cross service tree preemption
+ * is automatically taken care by algorithm
+ */
+ BUG_ON(new_st != st);
+ entity->finish = next_entity->finish - 1;
+ delta = bfq_delta(entity->budget, entity->weight);
+ entity->start = entity->finish - delta;
+ if (bfq_gt(entity->start, st->vtime))
+ entity->start = st->vtime;
+ }
+ } else {
+ bfq_calc_finish(entity, entity->budget);
+ }
+ bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+ __bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state. If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ struct io_sched_data *sd = entity->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+ int was_active = entity == sd->active_entity;
+ int ret = 0;
+
+ if (!entity->on_st)
+ return 0;
+
+ BUG_ON(was_active && entity->tree != NULL);
+
+ if (was_active) {
+ bfq_calc_finish(entity, entity->service);
+ sd->active_entity = NULL;
+ } else if (entity->tree == &st->active)
+ bfq_active_extract(st, entity);
+ else if (entity->tree == &st->idle)
+ bfq_idle_extract(st, entity);
+ else if (entity->tree != NULL)
+ BUG();
+
+ if (!requeue || !bfq_gt(entity->finish, st->vtime))
+ bfq_forget_entity(st, entity);
+ else
+ bfq_idle_insert(st, entity);
+
+ BUG_ON(sd->active_entity == entity);
+
+ return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+ __bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time. Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+ struct io_entity *entry;
+ struct rb_node *node = st->active.rb_node;
+
+ entry = rb_entry(node, struct io_entity, rb_node);
+ if (bfq_gt(entry->min_start, st->vtime)) {
+ st->vtime = entry->min_start;
+ bfq_forget_idle(st);
+ }
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity. The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+ struct io_entity *entry, *first = NULL;
+ struct rb_node *node = st->active.rb_node;
+
+ while (node != NULL) {
+ entry = rb_entry(node, struct io_entity, rb_node);
+left:
+ if (!bfq_gt(entry->start, st->vtime))
+ first = entry;
+
+ BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+ if (node->rb_left != NULL) {
+ entry = rb_entry(node->rb_left,
+ struct io_entity, rb_node);
+ if (!bfq_gt(entry->min_start, st->vtime)) {
+ node = node->rb_left;
+ goto left;
+ }
+ }
+ if (first != NULL)
+ break;
+ node = node->rb_right;
+ }
+
+ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+ return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+ struct io_entity *entity;
+
+ if (RB_EMPTY_ROOT(&st->active))
+ return NULL;
+
+ bfq_update_vtime(st);
+ entity = bfq_first_active_entity(st);
+ BUG_ON(bfq_gt(entity->start, st->vtime));
+
+ return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract)
+{
+ struct io_service_tree *st = sd->service_tree;
+ struct io_entity *entity;
+ int i;
+
+ /*
+ * One can check for which will be next selected entity without
+ * expiring the current one.
+ */
+ BUG_ON(extract && sd->active_entity != NULL);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ entity = __bfq_lookup_next_entity(st);
+ if (entity != NULL) {
+ if (extract) {
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ }
+ break;
+ }
+ }
+
+ return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+ struct io_service_tree *st;
+
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+ return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+ return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+ return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+ return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+ return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+
+ return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+ return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+ entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = jiffies_to_msecs(efqd->elv_slice_idle);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ data = msecs_to_jiffies(data);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice_idle = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[1];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[1] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->elv_slice[0];
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ /* 100ms is the limit for now*/
+ else if (data > 100)
+ data = 100;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->elv_slice[0] = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (elv_nr_busy_ioq(q->elevator)) {
+ elv_log(efqd, "schedule dispatch");
+ kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+ }
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+ struct elv_fq_data *efqd =
+ container_of(work, struct elv_fq_data, unplug_work);
+ struct request_queue *q = efqd->queue;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_start_queueing(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+ del_timer_sync(&e->efqd.idle_slice_timer);
+ cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ ioq->slice_end = jiffies + ioq->entity.budget;
+ elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ unsigned long elapsed = jiffies - ioq->last_end_request;
+ unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+ ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+ ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+ ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+ struct io_queue *ioq, struct request *rq)
+{
+ int old_idle, enable_idle;
+ struct elv_fq_data *efqd = ioq->efqd;
+
+ /*
+ * Don't idle for async or idle io prio class
+ */
+ if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+ return;
+
+ enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+ if (!efqd->elv_slice_idle)
+ enable_idle = 0;
+ else if (ioq_sample_valid(ioq->ttime_samples)) {
+ if (ioq->ttime_mean > efqd->elv_slice_idle)
+ enable_idle = 0;
+ else
+ enable_idle = 1;
+ }
+
+ /*
+ * From think time perspective idle should be enabled. Check with
+ * io scheduler if it wants to disable idling based on additional
+ * considrations like seek pattern.
+ */
+ if (enable_idle) {
+ if (eq->ops->elevator_update_idle_window_fn)
+ enable_idle = eq->ops->elevator_update_idle_window_fn(
+ eq, ioq->sched_queue, rq);
+ if (!enable_idle)
+ elv_log_ioq(efqd, ioq, "iosched disabled idle");
+ }
+
+ if (old_idle != enable_idle) {
+ elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+ if (enable_idle)
+ elv_mark_ioq_idle_window(ioq);
+ else
+ elv_clear_ioq_idle_window(ioq);
+ }
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+ struct io_queue *ioq = NULL;
+
+ ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+ return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+ kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio,
+ int is_sync)
+{
+ struct elv_fq_data *efqd = &eq->efqd;
+ struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+ RB_CLEAR_NODE(&ioq->entity.rb_node);
+ atomic_set(&ioq->ref, 0);
+ ioq->efqd = efqd;
+ elv_ioq_set_ioprio_class(ioq, ioprio_class);
+ elv_ioq_set_ioprio(ioq, ioprio);
+ ioq->pid = current->pid;
+ ioq->sched_queue = sched_queue;
+ if (is_sync && !elv_ioq_class_idle(ioq))
+ elv_mark_ioq_idle_window(ioq);
+ bfq_init_entity(&ioq->entity, iog);
+ ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+ return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = ioq->efqd;
+ struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+ efqd);
+
+ BUG_ON(atomic_read(&ioq->ref) <= 0);
+ if (!atomic_dec_and_test(&ioq->ref))
+ return;
+ BUG_ON(ioq->nr_queued);
+ BUG_ON(ioq->entity.tree != NULL);
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(efqd->active_queue == ioq);
+
+ /* Can be called by outgoing elevator. Don't use q */
+ BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+ e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+ elv_log_ioq(efqd, ioq, "put_queue");
+ elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+ struct io_queue *ioq = *ioq_ptr;
+
+ if (ioq != NULL) {
+ /* Drop the reference taken by the io group */
+ elv_put_ioq(ioq);
+ *ioq_ptr = NULL;
+ }
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ struct io_sched_data *sd = &efqd->root_group->sched_data;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+ BUG_ON(!efqd->busy_queues);
+ BUG_ON(sd != entity->sched_data);
+ BUG_ON(!st);
+
+ bfq_update_vtime(st);
+ bfq_active_extract(st, entity);
+ sd->active_entity = entity;
+ entity->service = 0;
+ elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = NULL;
+ struct io_queue *ioq = NULL;
+ struct io_sched_data *sd;
+
+ /*
+ * one can check for which queue will be selected next while having
+ * one queue active. preempt logic uses it.
+ */
+ BUG_ON(extract && efqd->active_queue != NULL);
+
+ if (!efqd->busy_queues)
+ return NULL;
+
+ sd = &efqd->root_group->sched_data;
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ BUG_ON(!entity);
+ if (extract)
+ entity->service = 0;
+ ioq = io_entity_to_ioq(entity);
+
+ return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int coop)
+{
+ struct request_queue *q = efqd->queue;
+
+ if (ioq) {
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+ efqd->busy_queues);
+ ioq->slice_end = 0;
+
+ elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_mark_ioq_slice_new(ioq);
+
+ del_timer(&efqd->idle_slice_timer);
+ }
+
+ efqd->active_queue = ioq;
+
+ /* Let iosched know if it wants to take some action */
+ if (ioq) {
+ if (q->elevator->ops->elevator_active_ioq_set_fn)
+ q->elevator->ops->elevator_active_ioq_set_fn(q,
+ ioq->sched_queue, coop);
+ }
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ int coop = 0;
+
+ if (!ioq)
+ ioq = elv_get_next_ioq(q, 1);
+ else {
+ elv_set_next_ioq(q, ioq);
+ /*
+ * io scheduler selected the next queue for us. Pass this
+ * this info back to io scheudler. cfq currently uses it
+ * to reset coop flag on the queue.
+ */
+ coop = 1;
+ }
+ __elv_set_active_ioq(efqd, ioq, coop);
+ return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+ struct request_queue *q = efqd->queue;
+ struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+ if (q->elevator->ops->elevator_active_ioq_reset_fn)
+ q->elevator->ops->elevator_active_ioq_reset_fn(q,
+ ioq->sched_queue);
+ efqd->active_queue = NULL;
+ del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+ bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue)
+{
+ if (ioq == efqd->active_queue)
+ elv_reset_active_ioq(efqd);
+
+ bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(ioq == efqd->active_queue);
+ elv_log_ioq(efqd, ioq, "add to busy");
+ elv_activate_ioq(ioq, 0);
+ elv_mark_ioq_busy(ioq);
+ efqd->busy_queues++;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ BUG_ON(!elv_ioq_busy(ioq));
+ BUG_ON(ioq->nr_queued);
+ elv_log_ioq(efqd, ioq, "del from busy");
+ elv_clear_ioq_busy(ioq);
+ BUG_ON(efqd->busy_queues == 0);
+ efqd->busy_queues--;
+ if (elv_ioq_class_rt(ioq))
+ efqd->busy_rt_queues--;
+
+ elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+ assert_spin_locked(q->queue_lock);
+ elv_log_ioq(efqd, ioq, "slice expired");
+
+ if (elv_ioq_wait_request(ioq))
+ del_timer(&efqd->idle_slice_timer);
+
+ elv_clear_ioq_wait_request(ioq);
+
+ /*
+ * if ioq->slice_end = 0, that means a queue was expired before first
+ * reuqest from the queue got completed. Of course we are not planning
+ * to idle on the queue otherwise we would not have expired it.
+ *
+ * Charge for the 25% slice in such cases. This is not the best thing
+ * to do but at the same time not very sure what's the next best
+ * thing to do.
+ *
+ * This arises from that fact that we don't have the notion of
+ * one queue being operational at one time. io scheduler can dispatch
+ * requests from multiple queues in one dispatch round. Ideally for
+ * more accurate accounting of exact disk time used by disk, one
+ * should dispatch requests from only one queue and wait for all
+ * the requests to finish. But this will reduce throughput.
+ */
+ if (!ioq->slice_end)
+ slice_used = entity->budget/4;
+ else {
+ if (time_after(ioq->slice_end, jiffies)) {
+ slice_unused = ioq->slice_end - jiffies;
+ if (slice_unused == entity->budget) {
+ /*
+ * queue got expired immediately after
+ * completing first request. Charge 25% of
+ * slice.
+ */
+ slice_used = entity->budget/4;
+ } else
+ slice_used = entity->budget - slice_unused;
+ } else {
+ slice_overshoot = jiffies - ioq->slice_end;
+ slice_used = entity->budget + slice_overshoot;
+ }
+ }
+
+ elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+ jiffies);
+ elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+ slice_used, entity->budget, slice_overshoot);
+ elv_ioq_served(ioq, slice_used);
+
+ BUG_ON(ioq != efqd->active_queue);
+ elv_reset_active_ioq(efqd);
+
+ if (!ioq->nr_queued)
+ elv_del_ioq_busy(q->elevator, ioq, 1);
+ else
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ * Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (ioq)
+ __elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+ struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elevator_queue *eq = q->elevator;
+
+ ioq = elv_active_ioq(eq);
+
+ if (!ioq)
+ return 0;
+
+ if (elv_ioq_slice_used(ioq))
+ return 1;
+
+ if (elv_ioq_class_idle(new_ioq))
+ return 0;
+
+ if (elv_ioq_class_idle(ioq))
+ return 1;
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+ if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+ return 1;
+
+ /*
+ * Check with io scheduler if it has additional criterion based on
+ * which it wants to preempt existing queue.
+ */
+ if (eq->ops->elevator_should_preempt_fn)
+ return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+ return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+ elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+ elv_ioq_slice_expired(q);
+
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
+
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ BUG_ON(!efqd);
+ BUG_ON(!ioq);
+ efqd->rq_queued++;
+ ioq->nr_queued++;
+
+ if (!elv_ioq_busy(ioq))
+ elv_add_ioq_busy(efqd, ioq);
+
+ elv_ioq_update_io_thinktime(ioq);
+ elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+ if (ioq == elv_active_ioq(q->elevator)) {
+ /*
+ * Remember that we saw a request from this process, but
+ * don't start queuing just yet. Otherwise we risk seeing lots
+ * of tiny requests, because we disrupt the normal plugging
+ * and merging. If the request is already larger than a single
+ * page, let it rip immediately. For that case we assume that
+ * merging is already done. Ditto for a busy system that
+ * has other work pending, don't risk delaying until the
+ * idle timer unplug to continue working.
+ */
+ if (elv_ioq_wait_request(ioq)) {
+ if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+ efqd->busy_queues > 1) {
+ del_timer(&efqd->idle_slice_timer);
+ blk_start_queueing(q);
+ }
+ elv_mark_ioq_must_dispatch(ioq);
+ }
+ } else if (elv_should_preempt(q, ioq, rq)) {
+ /*
+ * not the active queue - expire current slice if it is
+ * idle and has expired it's mean thinktime or this new queue
+ * has some old slice time left and is of higher priority or
+ * this new queue is RT and the current one is BE
+ */
+ elv_preempt_queue(q, ioq);
+ blk_start_queueing(q);
+ }
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+ struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+ struct io_queue *ioq;
+ unsigned long flags;
+ struct request_queue *q = efqd->queue;
+
+ elv_log(efqd, "idle timer fired");
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ ioq = efqd->active_queue;
+
+ if (ioq) {
+
+ /*
+ * We saw a request before the queue expired, let it through
+ */
+ if (elv_ioq_must_dispatch(ioq))
+ goto out_kick;
+
+ /*
+ * expired
+ */
+ if (elv_ioq_slice_used(ioq))
+ goto expire;
+
+ /*
+ * only expire and reinvoke request handler, if there are
+ * other queues with pending requests
+ */
+ if (!elv_nr_busy_ioq(q->elevator))
+ goto out_cont;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (ioq->nr_queued)
+ goto out_kick;
+ }
+expire:
+ elv_ioq_slice_expired(q);
+out_kick:
+ elv_schedule_dispatch(q);
+out_cont:
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+ unsigned long sl;
+
+ BUG_ON(!ioq);
+
+ /*
+ * SSD device without seek penalty, disable idling. But only do so
+ * for devices that support queuing, otherwise we still have a problem
+ * with sync vs async workloads.
+ */
+ if (blk_queue_nonrot(q) && efqd->hw_tag)
+ return;
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver)
+ return;
+
+ /*
+ * idle is disabled, either manually or by past process history
+ */
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ return;
+
+ /*
+ * may be iosched got its own idling logic. In that case io
+ * schduler will take care of arming the timer, if need be.
+ */
+ if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ q->elevator->ops->elevator_arm_slice_timer_fn(q,
+ ioq->sched_queue);
+ } else {
+ elv_mark_ioq_wait_request(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu", sl);
+ }
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+ struct io_queue *ioq, *n;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+ elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+ if (!elv_nr_busy_ioq(q->elevator))
+ return NULL;
+
+ if (ioq == NULL)
+ goto new_queue;
+
+ /*
+ * Force dispatch. Continue to dispatch from current queue as long
+ * as it has requests.
+ */
+ if (unlikely(force)) {
+ if (ioq->nr_queued)
+ goto keep_queue;
+ else
+ goto expire;
+ }
+
+ /*
+ * The active queue has run out of time, expire it and select new.
+ */
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+ goto expire;
+
+ /*
+ * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+ * cfqq.
+ */
+ if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+ /*
+ * We simulate this as cfqq timed out so that it gets to bank
+ * the remaining of its time slice.
+ */
+ elv_log_ioq(efqd, ioq, "preempt");
+ goto expire;
+ }
+
+ /*
+ * The active queue has requests and isn't expired, allow it to
+ * dispatch.
+ */
+
+ if (ioq->nr_queued)
+ goto keep_queue;
+
+ /*
+ * If another queue has a request waiting within our mean seek
+ * distance, let it run. The expire code will check for close
+ * cooperators and put the close queue at the front of the service
+ * tree.
+ */
+ new_ioq = elv_close_cooperator(q, ioq, 0);
+ if (new_ioq)
+ goto expire;
+
+ /*
+ * No requests pending. If the active queue still has requests in
+ * flight or is idling for a new request, allow either of these
+ * conditions to happen (or time out) before selecting a new queue.
+ */
+
+ if (timer_pending(&efqd->idle_slice_timer) ||
+ (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
+expire:
+ elv_ioq_slice_expired(q);
+new_queue:
+ ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+ return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq;
+ struct elv_fq_data *efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ ioq = rq->ioq;
+ BUG_ON(!ioq);
+ ioq->nr_queued--;
+
+ efqd = ioq->efqd;
+ BUG_ON(!efqd);
+ efqd->rq_queued--;
+
+ if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+ elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ BUG_ON(!ioq);
+ elv_ioq_request_dispatched(ioq);
+ elv_ioq_request_removed(e, rq);
+ elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ efqd->rq_in_driver++;
+ elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ WARN_ON(!efqd->rq_in_driver);
+ efqd->rq_in_driver--;
+ elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+ if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+ efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+ if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+ efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+ return;
+
+ if (efqd->hw_tag_samples++ < 50)
+ return;
+
+ if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+ efqd->hw_tag = 1;
+ else
+ efqd->hw_tag = 0;
+
+ efqd->hw_tag_samples = 0;
+ efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq, int probe)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *new_ioq = NULL;
+
+ /*
+ * Currently this feature is supported only for flat hierarchy or
+ * root group queues so that default cfq behavior is not changed.
+ */
+ if (!is_root_group_ioq(q, ioq))
+ return NULL;
+
+ if (q->elevator->ops->elevator_close_cooperator_fn)
+ new_ioq = e->ops->elevator_close_cooperator_fn(q,
+ ioq->sched_queue, probe);
+
+ /* Only select co-operating queue if it belongs to root group */
+ if (new_ioq && !is_root_group_ioq(q, new_ioq))
+ return NULL;
+
+ return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+ const int sync = rq_is_sync(rq);
+ struct io_queue *ioq = rq->ioq;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ elv_log_ioq(efqd, ioq, "complete");
+
+ elv_update_hw_tag(efqd);
+
+ WARN_ON(!efqd->rq_in_driver);
+ WARN_ON(!ioq->dispatched);
+ efqd->rq_in_driver--;
+ ioq->dispatched--;
+
+ if (sync)
+ ioq->last_end_request = jiffies;
+
+ /*
+ * If this is the active queue, check if it needs to be expired,
+ * or if we want to idle in case it has no pending requests.
+ */
+
+ if (elv_active_ioq(q->elevator) == ioq) {
+ if (elv_ioq_slice_new(ioq)) {
+ elv_ioq_set_prio_slice(q, ioq);
+ elv_clear_ioq_slice_new(ioq);
+ }
+ /*
+ * If there are no requests waiting in this queue, and
+ * there are other queues ready to issue requests, AND
+ * those other queues are issuing requests within our
+ * mean seek distance, give them a chance to run instead
+ * of idling.
+ */
+ if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ elv_ioq_slice_expired(q);
+ else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ && sync && !rq_noidle(rq))
+ elv_ioq_arm_slice_timer(q);
+ }
+
+ if (!efqd->rq_in_driver)
+ elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio)
+{
+ struct io_queue *ioq = NULL;
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ ioq = iog->async_queue[0][ioprio];
+ break;
+ case IOPRIO_CLASS_BE:
+ ioq = iog->async_queue[1][ioprio];
+ break;
+ case IOPRIO_CLASS_IDLE:
+ ioq = iog->async_idle_queue;
+ break;
+ default:
+ BUG();
+ }
+
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq)
+{
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ iog->async_queue[0][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_BE:
+ iog->async_queue[1][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_IDLE:
+ iog->async_idle_queue = ioq;
+ break;
+ default:
+ BUG();
+ }
+
+ /*
+ * Take the group reference and pin the queue. Group exit will
+ * clean it up
+ */
+ elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+ /*
+ * Caller already ensured that pending RCU callbacks are completed,
+ * so we should have no busy allocations at this point.
+ */
+ if (elv_ioq_pool)
+ kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+ elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+ if (!elv_ioq_pool)
+ goto fail;
+
+ return 0;
+fail:
+ elv_slab_kill();
+ return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ iog = io_alloc_root_group(q, e, efqd);
+ if (iog == NULL)
+ return 1;
+
+ efqd->root_group = iog;
+ efqd->queue = q;
+
+ init_timer(&efqd->idle_slice_timer);
+ efqd->idle_slice_timer.function = elv_idle_slice_timer;
+ efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+ INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+ INIT_LIST_HEAD(&efqd->idle_list);
+
+ efqd->elv_slice[0] = elv_slice_async;
+ efqd->elv_slice[1] = elv_slice_sync;
+ efqd->elv_slice_idle = elv_slice_idle;
+ efqd->hw_tag = 1;
+
+ return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+ struct request_queue *q = efqd->queue;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+
+ spin_lock_irq(q->queue_lock);
+ /* This should drop all the idle tree references of ioq */
+ elv_free_idle_ioq_list(e);
+ spin_unlock_irq(q->queue_lock);
+
+ elv_shutdown_timer_wq(e);
+
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+ io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = &e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+ if (elv_slab_setup())
+ return -ENOMEM;
+
+ /* could be 0 on HZ < 1000 setups */
+
+ if (!elv_slice_async)
+ elv_slice_async = 1;
+
+ if (!elv_slice_idle)
+ elv_slice_idle = 1;
+
+ return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES 3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree. All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+ struct rb_root active;
+ struct rb_root idle;
+
+ struct io_entity *first_idle;
+ struct io_entity *last_idle;
+
+ bfq_timestamp_t vtime;
+ bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue. It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+ struct io_entity *active_entity;
+ struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ * the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ * this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ * associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ * ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ * ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy. Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now. Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag. As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ. When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+ struct rb_node rb_node;
+
+ int on_st;
+
+ bfq_timestamp_t finish;
+ bfq_timestamp_t start;
+
+ struct rb_root *tree;
+
+ bfq_timestamp_t min_start;
+
+ bfq_service_t service, budget;
+ bfq_weight_t weight;
+
+ struct io_entity *parent;
+
+ struct io_sched_data *my_sched_data;
+ struct io_sched_data *sched_data;
+
+ unsigned short ioprio, new_ioprio;
+ unsigned short ioprio_class, new_ioprio_class;
+
+ int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+ struct io_entity entity;
+ atomic_t ref;
+ unsigned int flags;
+
+ /* Pointer to generic elevator data structure */
+ struct elv_fq_data *efqd;
+ struct list_head queue_list;
+ pid_t pid;
+
+ /* Number of requests queued on this io queue */
+ unsigned long nr_queued;
+
+ /* Requests dispatched from this queue */
+ int dispatched;
+
+ /* Keep a track of think time of processes in this queue */
+ unsigned long last_end_request;
+ unsigned long ttime_total;
+ unsigned long ttime_samples;
+ unsigned long ttime_mean;
+
+ unsigned long slice_end;
+
+ /* Pointer to io scheduler's queue */
+ void *sched_queue;
+};
+
+struct io_group {
+ struct io_sched_data sched_data;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+ struct io_group *root_group;
+
+ /* List of io queues on idle tree. */
+ struct list_head idle_list;
+
+ struct request_queue *queue;
+ unsigned int busy_queues;
+ /*
+ * Used to track any pending rt requests so we can pre-empt current
+ * non-RT cfqq in service when this value is non-zero.
+ */
+ unsigned int busy_rt_queues;
+
+ /* Number of requests queued */
+ int rq_queued;
+
+ /* Pointer to the ioscheduler queue being served */
+ void *active_queue;
+
+ int rq_in_driver;
+ int hw_tag;
+ int hw_tag_samples;
+ int rq_in_driver_peak;
+
+ /*
+ * elevator fair queuing layer has the capability to provide idling
+ * for ensuring fairness for processes doing dependent reads.
+ * This might be needed to ensure fairness among two processes doing
+ * synchronous reads in two different cgroups. noop and deadline don't
+ * have any notion of anticipation/idling. As of now, these are the
+ * users of this functionality.
+ */
+ unsigned int elv_slice_idle;
+ struct timer_list idle_slice_timer;
+ struct work_struct unplug_work;
+
+ unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid, \
+ elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples) ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+ ELV_QUEUE_FLAG_busy = 0, /* has requests or is under service */
+ ELV_QUEUE_FLAG_sync, /* synchronous queue */
+ ELV_QUEUE_FLAG_idle_window, /* elevator slice idling enabled */
+ ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
+ ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
+ ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name) \
+static inline void elv_mark_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline int elv_ioq_##name(struct io_queue *ioq) \
+{ \
+ return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0; \
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+ struct io_sched_data *sched_data = entity->sched_data;
+ unsigned int idx = entity->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+ BUG_ON(sched_data == NULL);
+
+ return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+ ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+ if (elv_ioq_slice_new(ioq))
+ return 0;
+ if (time_before(jiffies, ioq->slice_end))
+ return 0;
+
+ return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+ return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+ return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+ return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+ return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+ return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+ atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+ unsigned long slice_end)
+{
+ ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+ return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+ int ioprio_class)
+{
+ ioq->entity.new_ioprio_class = ioprio_class;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+ ioq->entity.new_ioprio = ioprio;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+ return container_of(ioq->entity.sched_data, struct io_group,
+ sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+ size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+ struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+ unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+ struct elevator_queue *e)
+{
+ return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+ return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
for (i = 0; i < ELV_HASH_ENTRIES; i++)
INIT_HLIST_HEAD(&eq->hash[i]);
+ if (elv_init_fq_data(q, eq))
+ goto err;
+
return eq;
err:
kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
e->ops = NULL;
+ elv_exit_fq_data_post(e);
mutex_unlock(&e->sysfs_lock);
kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_activate_rq(q, rq);
+
if (e->ops->elevator_activate_req_fn)
e->ops->elevator_activate_req_fn(q, rq);
}
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_deactivate_rq(q, rq);
+
if (e->ops->elevator_deactivate_req_fn)
e->ops->elevator_deactivate_req_fn(q, rq);
}
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
boundary = q->end_sector;
stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);
q->nr_sorted--;
+ elv_fq_dispatched_request(q->elevator, rq);
q->end_sector = rq_end_sector(rq);
q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
elv_rqhash_del(q, next);
q->nr_sorted--;
+ elv_ioq_request_removed(e, next);
q->last_merge = rq;
}
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->last_merge = rq;
}
- /*
- * Some ioscheds (cfq) run q->request_fn directly, so
- * rq cannot be accessed after calling
- * elevator_add_req_fn.
- */
q->elevator->ops->elevator_add_req_fn(q, rq);
+ elv_ioq_request_add(q, rq);
break;
case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
int elv_queue_empty(struct request_queue *q)
{
- struct elevator_queue *e = q->elevator;
-
if (!list_empty(&q->queue_head))
return 0;
- if (e->ops->elevator_queue_empty_fn)
- return e->ops->elevator_queue_empty_fn(q);
+ /* Hopefully nr_sorted works and no need to call queue_empty_fn */
+ if (q->nr_sorted)
+ return 0;
return 1;
}
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
*/
if (blk_account_rq(rq)) {
q->in_flight--;
- if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
- e->ops->elevator_completed_req_fn(q, rq);
+ if (blk_sorted_rq(rq)) {
+ if (e->ops->elevator_completed_req_fn)
+ e->ops->elevator_completed_req_fn(q, rq);
+ elv_ioq_completed_request(q, rq);
+ }
}
/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
return NULL;
}
EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+ return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+ return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
/* for bidi */
struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* io queue request belongs to */
+ struct io_queue *ioq;
+#endif
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
#define _LINUX_ELEVATOR_H
#include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
#ifdef CONFIG_BLOCK
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+ struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+ struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+ void*, int probe);
+#endif
struct elevator_ops
{
@@ -56,6 +69,17 @@ struct elevator_ops
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+ elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+ elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+ elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+ elevator_should_preempt_fn *elevator_should_preempt_fn;
+ elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+ elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
};
#define ELV_NAME_MAX (16)
@@ -76,6 +100,9 @@ struct elevator_type
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ int elevator_features;
+#endif
};
/*
@@ -89,6 +116,10 @@ struct elevator_queue
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* fair queuing data */
+ struct elv_fq_data efqd;
+#endif
};
/*
@@ -209,5 +240,25 @@ enum {
__val; \
})
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define ELV_IOSCHED_NEED_FQ 1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
2009-05-05 19:58 ` Vivek Goyal
(?)
@ 2009-05-22 6:43 ` Gui Jianfeng
2009-05-22 12:32 ` Vivek Goyal
[not found] ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
-1 siblings, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22 6:43 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
> +/* A request got completed from io_queue. Do the accounting. */
> +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> +{
> + const int sync = rq_is_sync(rq);
> + struct io_queue *ioq = rq->ioq;
> + struct elv_fq_data *efqd = &q->elevator->efqd;
> +
> + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> + return;
> +
> + elv_log_ioq(efqd, ioq, "complete");
> +
> + elv_update_hw_tag(efqd);
> +
> + WARN_ON(!efqd->rq_in_driver);
> + WARN_ON(!ioq->dispatched);
> + efqd->rq_in_driver--;
> + ioq->dispatched--;
> +
> + if (sync)
> + ioq->last_end_request = jiffies;
> +
> + /*
> + * If this is the active queue, check if it needs to be expired,
> + * or if we want to idle in case it has no pending requests.
> + */
> +
> + if (elv_active_ioq(q->elevator) == ioq) {
> + if (elv_ioq_slice_new(ioq)) {
> + elv_ioq_set_prio_slice(q, ioq);
Hi Vivek,
Would you explain a bit why slice_end should be set when first request completes.
Why not set it just when an ioq gets active?
Thanks.
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
2009-05-22 6:43 ` Gui Jianfeng
@ 2009-05-22 12:32 ` Vivek Goyal
2009-05-23 20:04 ` Jens Axboe
[not found] ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[not found] ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
1 sibling, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:32 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +/* A request got completed from io_queue. Do the accounting. */
> > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > +{
> > + const int sync = rq_is_sync(rq);
> > + struct io_queue *ioq = rq->ioq;
> > + struct elv_fq_data *efqd = &q->elevator->efqd;
> > +
> > + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > + return;
> > +
> > + elv_log_ioq(efqd, ioq, "complete");
> > +
> > + elv_update_hw_tag(efqd);
> > +
> > + WARN_ON(!efqd->rq_in_driver);
> > + WARN_ON(!ioq->dispatched);
> > + efqd->rq_in_driver--;
> > + ioq->dispatched--;
> > +
> > + if (sync)
> > + ioq->last_end_request = jiffies;
> > +
> > + /*
> > + * If this is the active queue, check if it needs to be expired,
> > + * or if we want to idle in case it has no pending requests.
> > + */
> > +
> > + if (elv_active_ioq(q->elevator) == ioq) {
> > + if (elv_ioq_slice_new(ioq)) {
> > + elv_ioq_set_prio_slice(q, ioq);
>
> Hi Vivek,
>
> Would you explain a bit why slice_end should be set when first request completes.
> Why not set it just when an ioq gets active?
>
Hi Gui,
I have kept the behavior same as CFQ. I guess reason behind this is that
when a new queue is scheduled in, first request completion might take more
time as head of the disk might be quite a distance away (due to previous
queue) and one probably does not want to charge the new queue for that
first seek time. That's the reason we start the queue slice when first
request has completed.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
2009-05-22 12:32 ` Vivek Goyal
@ 2009-05-23 20:04 ` Jens Axboe
[not found] ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Jens Axboe @ 2009-05-23 20:04 UTC (permalink / raw)
To: Vivek Goyal
Cc: Gui Jianfeng, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Fri, May 22 2009, Vivek Goyal wrote:
> On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > > +/* A request got completed from io_queue. Do the accounting. */
> > > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > > +{
> > > + const int sync = rq_is_sync(rq);
> > > + struct io_queue *ioq = rq->ioq;
> > > + struct elv_fq_data *efqd = &q->elevator->efqd;
> > > +
> > > + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > > + return;
> > > +
> > > + elv_log_ioq(efqd, ioq, "complete");
> > > +
> > > + elv_update_hw_tag(efqd);
> > > +
> > > + WARN_ON(!efqd->rq_in_driver);
> > > + WARN_ON(!ioq->dispatched);
> > > + efqd->rq_in_driver--;
> > > + ioq->dispatched--;
> > > +
> > > + if (sync)
> > > + ioq->last_end_request = jiffies;
> > > +
> > > + /*
> > > + * If this is the active queue, check if it needs to be expired,
> > > + * or if we want to idle in case it has no pending requests.
> > > + */
> > > +
> > > + if (elv_active_ioq(q->elevator) == ioq) {
> > > + if (elv_ioq_slice_new(ioq)) {
> > > + elv_ioq_set_prio_slice(q, ioq);
> >
> > Hi Vivek,
> >
> > Would you explain a bit why slice_end should be set when first request completes.
> > Why not set it just when an ioq gets active?
> >
>
> Hi Gui,
>
> I have kept the behavior same as CFQ. I guess reason behind this is that
> when a new queue is scheduled in, first request completion might take more
> time as head of the disk might be quite a distance away (due to previous
> queue) and one probably does not want to charge the new queue for that
> first seek time. That's the reason we start the queue slice when first
> request has completed.
That's exactly why CFQ does it that way. And not just for the seek
itself, but if have eg writes issued before the switch to a new queue,
it's not fair to charge the potential cache writeout happening ahead of
the read to that new queue. So I'd definitely recommend keeping this
behaviour, as you have.
--
Jens Axboe
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
[not found] ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-23 20:04 ` Jens Axboe
0 siblings, 0 replies; 297+ messages in thread
From: Jens Axboe @ 2009-05-23 20:04 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Fri, May 22 2009, Vivek Goyal wrote:
> On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > > +/* A request got completed from io_queue. Do the accounting. */
> > > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > > +{
> > > + const int sync = rq_is_sync(rq);
> > > + struct io_queue *ioq = rq->ioq;
> > > + struct elv_fq_data *efqd = &q->elevator->efqd;
> > > +
> > > + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > > + return;
> > > +
> > > + elv_log_ioq(efqd, ioq, "complete");
> > > +
> > > + elv_update_hw_tag(efqd);
> > > +
> > > + WARN_ON(!efqd->rq_in_driver);
> > > + WARN_ON(!ioq->dispatched);
> > > + efqd->rq_in_driver--;
> > > + ioq->dispatched--;
> > > +
> > > + if (sync)
> > > + ioq->last_end_request = jiffies;
> > > +
> > > + /*
> > > + * If this is the active queue, check if it needs to be expired,
> > > + * or if we want to idle in case it has no pending requests.
> > > + */
> > > +
> > > + if (elv_active_ioq(q->elevator) == ioq) {
> > > + if (elv_ioq_slice_new(ioq)) {
> > > + elv_ioq_set_prio_slice(q, ioq);
> >
> > Hi Vivek,
> >
> > Would you explain a bit why slice_end should be set when first request completes.
> > Why not set it just when an ioq gets active?
> >
>
> Hi Gui,
>
> I have kept the behavior same as CFQ. I guess reason behind this is that
> when a new queue is scheduled in, first request completion might take more
> time as head of the disk might be quite a distance away (due to previous
> queue) and one probably does not want to charge the new queue for that
> first seek time. That's the reason we start the queue slice when first
> request has completed.
That's exactly why CFQ does it that way. And not just for the seek
itself, but if have eg writes issued before the switch to a new queue,
it's not fair to charge the potential cache writeout happening ahead of
the read to that new queue. So I'd definitely recommend keeping this
behaviour, as you have.
--
Jens Axboe
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
[not found] ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:32 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:32 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +/* A request got completed from io_queue. Do the accounting. */
> > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > +{
> > + const int sync = rq_is_sync(rq);
> > + struct io_queue *ioq = rq->ioq;
> > + struct elv_fq_data *efqd = &q->elevator->efqd;
> > +
> > + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > + return;
> > +
> > + elv_log_ioq(efqd, ioq, "complete");
> > +
> > + elv_update_hw_tag(efqd);
> > +
> > + WARN_ON(!efqd->rq_in_driver);
> > + WARN_ON(!ioq->dispatched);
> > + efqd->rq_in_driver--;
> > + ioq->dispatched--;
> > +
> > + if (sync)
> > + ioq->last_end_request = jiffies;
> > +
> > + /*
> > + * If this is the active queue, check if it needs to be expired,
> > + * or if we want to idle in case it has no pending requests.
> > + */
> > +
> > + if (elv_active_ioq(q->elevator) == ioq) {
> > + if (elv_ioq_slice_new(ioq)) {
> > + elv_ioq_set_prio_slice(q, ioq);
>
> Hi Vivek,
>
> Would you explain a bit why slice_end should be set when first request completes.
> Why not set it just when an ioq gets active?
>
Hi Gui,
I have kept the behavior same as CFQ. I guess reason behind this is that
when a new queue is scheduled in, first request completion might take more
time as head of the disk might be quite a distance away (due to previous
queue) and one probably does not want to charge the new queue for that
first seek time. That's the reason we start the queue slice when first
request has completed.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <1241553525-28095-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
[not found] ` <1241553525-28095-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22 6:43 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22 6:43 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
> +/* A request got completed from io_queue. Do the accounting. */
> +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> +{
> + const int sync = rq_is_sync(rq);
> + struct io_queue *ioq = rq->ioq;
> + struct elv_fq_data *efqd = &q->elevator->efqd;
> +
> + if (!elv_iosched_fair_queuing_enabled(q->elevator))
> + return;
> +
> + elv_log_ioq(efqd, ioq, "complete");
> +
> + elv_update_hw_tag(efqd);
> +
> + WARN_ON(!efqd->rq_in_driver);
> + WARN_ON(!ioq->dispatched);
> + efqd->rq_in_driver--;
> + ioq->dispatched--;
> +
> + if (sync)
> + ioq->last_end_request = jiffies;
> +
> + /*
> + * If this is the active queue, check if it needs to be expired,
> + * or if we want to idle in case it has no pending requests.
> + */
> +
> + if (elv_active_ioq(q->elevator) == ioq) {
> + if (elv_ioq_slice_new(ioq)) {
> + elv_ioq_set_prio_slice(q, ioq);
Hi Vivek,
Would you explain a bit why slice_end should be set when first request completes.
Why not set it just when an ioq gets active?
Thanks.
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
` (18 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o There are situations where a queue gets expired very soon and it looks
as if time slice used by that queue is zero. For example, If an async
queue dispatches a bunch of requests and queue is expired before first
request completes. Another example is where a queue is expired as soon
as first request completes and queue has no more requests (sync queues
on SSD).
o Currently we just charge 25% of slice length in such cases. This patch tries
to improve on that approximation by keeping a track of average disk rate
and charging for time by nr_sectors/disk_rate.
o This is still experimental, not very sure if it gives measurable improvement
or not.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/elevator-fq.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 11 ++++++
2 files changed, 94 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
int elv_slice_idle = HZ / 125;
static struct kmem_cache *elv_ioq_pool;
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
}
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+ long elapsed = jiffies - efqd->rate_sampling_start;
+ unsigned long total;
+
+ /* sampling window is off */
+ if (!efqd->rate_sampling_start)
+ return;
+
+ efqd->rate_sectors_current += rq->nr_sectors;
+
+ if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+ return;
+
+ efqd->rate_sectors = (7*efqd->rate_sectors +
+ 256*efqd->rate_sectors_current) / 8;
+
+ if (!elapsed) {
+ /*
+ * updating rate before a jiffy could complete. Could be a
+ * problem with fast queuing/non-queuing hardware. Should we
+ * look at higher resolution time source?
+ *
+ * In case of non-queuing hardware we will probably not try to
+ * dispatch from multiple queues and will be able to account
+ * for disk time used and will not need this approximation
+ * anyway?
+ */
+ elapsed = 1;
+ }
+
+ efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+ total = efqd->rate_sectors + (efqd->rate_time/2);
+ efqd->mean_rate = total/efqd->rate_time;
+
+ elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+ elapsed, efqd->rate_sectors_current);
+ efqd->rate_sampling_start = 0;
+ efqd->rate_sectors_current = 0;
+}
+
/*
* Disable idle window if the process thinks too long.
* This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
}
/*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+ struct io_entity *entity = &ioq->entity;
+ unsigned long jiffies_used = 0;
+
+ if (!efqd->mean_rate)
+ return entity->budget/4;
+
+ /* Charge the queue based on average disk rate */
+ jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+ if (!jiffies_used)
+ jiffies_used = 1;
+
+ elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+ jiffies_to_msecs(jiffies_used),
+ ioq->nr_sectors, efqd->mean_rate);
+
+ return jiffies_used;
+}
+
+/*
* Do the accounting. Determine how much service (in terms of time slices)
* current queue used and adjust the start, finish time of queue and vtime
* of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* the requests to finish. But this will reduce throughput.
*/
if (!ioq->slice_end)
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
else {
if (time_after(ioq->slice_end, jiffies)) {
slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
* completing first request. Charge 25% of
* slice.
*/
- slice_used = entity->budget/4;
+ slice_used = elv_disk_time_used(q, ioq);
} else
slice_used = entity->budget - slice_unused;
} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
BUG_ON(ioq != efqd->active_queue);
elv_reset_active_ioq(efqd);
+ /* Queue is being expired. Reset number of secotrs dispatched */
+ ioq->nr_sectors = 0;
if (!ioq->nr_queued)
elv_del_ioq_busy(q->elevator, ioq, 1);
else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
BUG_ON(!ioq);
elv_ioq_request_dispatched(ioq);
+ ioq->nr_sectors += rq->nr_sectors;
elv_ioq_request_removed(e, rq);
elv_clear_ioq_must_dispatch(ioq);
}
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
return;
efqd->rq_in_driver++;
+
+ if (!efqd->rate_sampling_start)
+ efqd->rate_sampling_start = jiffies;
+
elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
efqd->rq_in_driver);
}
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
efqd->rq_in_driver--;
ioq->dispatched--;
+ elv_update_io_rate(efqd, rq);
+
if (sync)
ioq->last_end_request = jiffies;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
/* Requests dispatched from this queue */
int dispatched;
+ /* Number of sectors dispatched in current dispatch round */
+ int nr_sectors;
+
/* Keep a track of think time of processes in this queue */
unsigned long last_end_request;
unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
struct work_struct unplug_work;
unsigned int elv_slice[2];
+
+ /* Fields for keeping track of average disk rate */
+ unsigned long rate_sectors; /* number of sectors finished */
+ unsigned long rate_time; /* jiffies elapsed */
+ unsigned long mean_rate; /* sectors per jiffy */
+ unsigned long long rate_sampling_start; /*sampling window start jifies*/
+ /* number of sectors finished io during current sampling window */
+ unsigned long rate_sectors_current;
};
extern int elv_slice_idle;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
` (17 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This patch changes cfq to use fair queuing code from elevator layer.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 3 +-
block/cfq-iosched.c | 1097 ++++++++++---------------------------------------
2 files changed, 219 insertions(+), 881 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
menu "IO Schedulers"
config ELV_FAIR_QUEUING
- bool "Elevator Fair Queuing Support"
+ bool
default n
---help---
Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select ELV_FAIR_QUEUING
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
-
/*
* tunables
*/
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
static const int cfq_back_max = 16 * 1024;
/* penalty of a backwards seek */
static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY (HZ / 5)
/*
* below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq) (struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
static struct kmem_cache *cfq_pool;
static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);
#define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
#define sample_valid(samples) ((samples) > 80)
@@ -75,12 +64,6 @@ struct cfq_rb_root {
*/
struct cfq_data {
struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_rb_root service_tree;
-
/*
* Each priority tree is sorted by next_request position. These
* trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
*/
struct rb_root prio_trees[CFQ_PRIO_LISTS];
- unsigned int busy_queues;
- /*
- * Used to track any pending rt requests so we can pre-empt current
- * non-RT cfqq in service when this value is non-zero.
- */
- unsigned int busy_rt_queues;
-
- int rq_in_driver;
int sync_flight;
- /*
- * queue-depth detection
- */
- int rq_queued;
- int hw_tag;
- int hw_tag_samples;
- int rq_in_driver_peak;
-
- /*
- * idle window management
- */
- struct timer_list idle_slice_timer;
- struct work_struct unplug_work;
-
- struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;
- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;
unsigned long last_end_request;
@@ -131,9 +85,7 @@ struct cfq_data {
unsigned int cfq_fifo_expire[2];
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
- unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
- unsigned int cfq_slice_idle;
struct list_head cic_list;
};
@@ -142,16 +94,11 @@ struct cfq_data {
* Per process-grouping structure
*/
struct cfq_queue {
- /* reference count */
- atomic_t ref;
+ struct io_queue *ioq;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;
- unsigned long slice_end;
- long slice_resid;
unsigned int slice_dispatch;
/* pending metadata requests */
int meta_pending;
- /* number of requests that are on the dispatch list or inside driver */
- int dispatched;
/* io prio of this group */
- unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
+ unsigned short org_ioprio;
+ unsigned short org_ioprio_class;
pid_t pid;
};
enum cfqq_state_flags {
- CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
- CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
- CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */
CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
- CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
- CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
- CFQ_CFQQ_FLAG_sync, /* synchronous queue */
CFQ_CFQQ_FLAG_coop, /* has done a coop jump of the queue */
};
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \
return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0; \
}
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
CFQ_CFQQ_FNS(must_alloc);
CFQ_CFQQ_FNS(must_alloc_slice);
CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
return 0;
}
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
{
- if (cfqd->busy_queues) {
- cfq_log(cfqd, "schedule dispatch");
- kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
- }
+ return ioq_to_io_group(cfqq->ioq);
}
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- return !cfqd->busy_queues;
+ return elv_ioq_class_idle(cfqq->ioq);
}
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
- unsigned short prio)
-{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
{
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+ return elv_ioq_class_rt(cfqq->ioq);
}
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
- cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+ return elv_ioq_sync(cfqq->ioq);
}
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
+ struct cfq_data *cfqd = cfqq->cfqd;
+ struct elevator_queue *e = cfqd->queue->elevator;
- return 1;
+ return (elv_active_sched_queue(e) == cfqq);
}
/*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
}
/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
* would be nice to take fifo expire time into account as well
*/
static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
BUG_ON(RB_EMPTY_NODE(&last->rb_node));
- if (rbprev)
+ if (rbprev != NULL)
prev = rb_entry_rq(rbprev);
- if (rbnext)
+ if (rbnext != NULL)
next = rb_entry_rq(rbnext);
else {
rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev);
}
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
-{
- struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
-
- if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
- } else
- rb_key += jiffies;
- } else if (!add_front) {
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key += cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else
- rb_key = 0;
-
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- }
-
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (rb_key < __cfqq->rb_key)
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
-
- if (n == &(*p)->rb_right)
- left = 0;
-
- p = n;
- }
-
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
-
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
static struct cfq_queue *
cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
{
- /*
- * Resorting requires the cfqq to be on the RR list already.
- */
- if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
- cfq_prio_tree_add(cfqd, cfqq);
- }
-}
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- cfq_mark_cfqq_on_rr(cfqq);
- cfqd->busy_queues++;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues++;
+ if (cfqd->active_cic) {
+ put_io_context(cfqd->active_cic->ioc);
+ cfqd->active_cic = NULL;
+ }
- cfq_resort_rr_list(cfqd, cfqq);
+ /* Resort the cfqq in prio tree */
+ if (cfqq)
+ cfq_prio_tree_add(cfqd, cfqq);
}
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
{
- cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
- cfq_clear_cfqq_on_rr(cfqq);
+ struct cfq_queue *cfqq = sched_queue;
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- if (cfqq->p_root) {
- rb_erase(&cfqq->p_node, cfqq->p_root);
- cfqq->p_root = NULL;
- }
+ cfqq->slice_dispatch = 0;
- BUG_ON(!cfqd->busy_queues);
- cfqd->busy_queues--;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues--;
+ cfq_clear_cfqq_must_alloc_slice(cfqq);
+ cfq_clear_cfqq_fifo_expire(cfqq);
+ if (!coop)
+ cfq_clear_cfqq_coop(cfqq);
}
/*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);
BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
elv_rb_del(&cfqq->sort_list, rq);
- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ /*
+ * If this was last request in the queue, remove this queue from
+ * prio trees. For last request nr_queued count will still be 1 as
+ * elevator fair queuing layer is yet to do the accounting.
+ */
+ if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}
static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
cfq_dispatch_insert(cfqd->queue, __alias);
- if (!cfq_cfqq_on_rr(cfqq))
- cfq_add_cfqq_rr(cfqd, cfqq);
-
/*
* check if this request is a better next-serve candidate
*/
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
- cfqd->rq_in_driver++;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
- cfqd->rq_in_driver);
-
cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
}
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- WARN_ON(!cfqd->rq_in_driver);
- cfqd->rq_in_driver--;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
- cfqd->rq_in_driver);
-}
-
static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);
- cfqq->cfqd->rq_queued--;
if (rq_is_meta(rq)) {
WARN_ON(!cfqq->meta_pending);
cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
return 0;
}
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (cfqq) {
- cfq_log_cfqq(cfqd, cfqq, "set_active");
- cfqq->slice_end = 0;
- cfqq->slice_dispatch = 0;
-
- cfq_clear_cfqq_wait_request(cfqq);
- cfq_clear_cfqq_must_dispatch(cfqq);
- cfq_clear_cfqq_must_alloc_slice(cfqq);
- cfq_clear_cfqq_fifo_expire(cfqq);
- cfq_mark_cfqq_slice_new(cfqq);
-
- del_timer(&cfqd->idle_slice_timer);
- }
-
- cfqd->active_queue = cfqq;
-}
-
/*
* current cfqq expired its slice (or was too idle), select new one
*/
static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
- if (cfq_cfqq_wait_request(cfqq))
- del_timer(&cfqd->idle_slice_timer);
-
- cfq_clear_cfqq_wait_request(cfqq);
-
- /*
- * store what was left of this slice, if the queue idled/timed out
- */
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
-
- cfq_resort_rr_list(cfqd, cfqq);
-
- if (cfqq == cfqd->active_queue)
- cfqd->active_queue = NULL;
-
- if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->ioc);
- cfqd->active_cic = NULL;
- }
+ __elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
}
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
-
- return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (!cfqq) {
- cfqq = cfq_get_next_queue(cfqd);
- if (cfqq)
- cfq_clear_cfqq_coop(cfqq);
- }
-
- __cfq_set_active_queue(cfqd, cfqq);
- return cfqq;
+ __cfq_slice_expired(cfqd, cfqq);
}
static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* associated with the I/O issued by cur_cfqq. I'm not sure this is a valid
* assumption.
*/
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
- struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+ void *cur_sched_queue,
int probe)
{
- struct cfq_queue *cfqq;
+ struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
/*
* A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!probe)
cfq_mark_cfqq_coop(cfqq);
- return cfqq;
+ return cfqq->ioq;
}
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_io_context *cic;
unsigned long sl;
- /*
- * SSD device without seek penalty, disable idling. But only do so
- * for devices that support queuing, otherwise we still have a problem
- * with sync vs async workloads.
- */
- if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
-
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
- WARN_ON(cfq_cfqq_slice_new(cfqq));
-
- /*
- * idle is disabled, either manually or by past process history
- */
- if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
- return;
-
- /*
- * still requests with the driver, don't idle
- */
- if (cfqd->rq_in_driver)
- return;
-
+ WARN_ON(elv_ioq_slice_new(cfqq->ioq));
/*
* task has exited, don't wait
*/
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
return;
- cfq_mark_cfqq_wait_request(cfqq);
+ elv_mark_ioq_wait_request(cfqq->ioq);
/*
* we don't want to idle for seeks, but we do want to allow
* fair distribution of slice time for a process doing back-to-back
* seeks. so allow a little bit of time for him to submit a new rq
*/
- sl = cfqd->cfq_slice_idle;
+ sl = elv_get_slice_idle(q->elevator);
if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
- mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+ elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
}
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
*/
static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);
+ struct cfq_data *cfqd = q->elevator->elevator_data;
- cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+ cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
cfq_remove_request(rq);
- cfqq->dispatched++;
elv_dispatch_sort(q, rq);
if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
const int base_rq = cfqd->cfq_slice_async_rq;
+ unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
- WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
- return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
- struct cfq_queue *cfqq, *new_cfqq = NULL;
-
- cfqq = cfqd->active_queue;
- if (!cfqq)
- goto new_queue;
-
- /*
- * The active queue has run out of time, expire it and select new.
- */
- if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
- goto expire;
-
- /*
- * If we have a RT cfqq waiting, then we pre-empt the current non-rt
- * cfqq.
- */
- if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
- /*
- * We simulate this as cfqq timed out so that it gets to bank
- * the remaining of its time slice.
- */
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
- goto new_queue;
- }
-
- /*
- * The active queue has requests and isn't expired, allow it to
- * dispatch.
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto keep_queue;
-
- /*
- * If another queue has a request waiting within our mean seek
- * distance, let it run. The expire code will check for close
- * cooperators and put the close queue at the front of the service
- * tree.
- */
- new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
- if (new_cfqq)
- goto expire;
+ WARN_ON(ioprio >= IOPRIO_BE_NR);
- /*
- * No requests pending. If the active queue still has requests in
- * flight or is idling for a new request, allow either of these
- * conditions to happen (or time out) before selecting a new queue.
- */
- if (timer_pending(&cfqd->idle_slice_timer) ||
- (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
- cfqq = NULL;
- goto keep_queue;
- }
-
-expire:
- cfq_slice_expired(cfqd, 0);
-new_queue:
- cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
- return cfqq;
+ return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
}
static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
- cfq_slice_expired(cfqd, 0);
+ /* This probably is redundant now. above loop will should make sure
+ * that all the busy queues have expired */
+ cfq_slice_expired(cfqd);
- BUG_ON(cfqd->busy_queues);
+ BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
struct cfq_queue *cfqq;
unsigned int max_dispatch;
- if (!cfqd->busy_queues)
- return 0;
-
if (unlikely(force))
return cfq_forced_dispatch(cfqd);
- cfqq = cfq_select_queue(cfqd);
+ cfqq = elv_select_sched_queue(q, 0);
if (!cfqq)
return 0;
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* Does this cfqq already have too much IO in flight?
*/
- if (cfqq->dispatched >= max_dispatch) {
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
/*
* idle queue must always only have a single IO in flight
*/
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* We have other queues, don't allow more IO from this one
*/
- if (cfqd->busy_queues > 1)
+ if (elv_nr_busy_ioq(q->elevator) > 1)
return 0;
/*
* we are the only queue, allow up to 4 times of 'quantum'
*/
- if (cfqq->dispatched >= 4 * max_dispatch)
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
return 0;
}
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
*/
cfq_dispatch_request(cfqd, cfqq);
cfqq->slice_dispatch++;
- cfq_clear_cfqq_must_dispatch(cfqq);
/*
* expire an async queue immediately if it has used up its slice. idle
* queue always expire after 1 dispatch round.
*/
- if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+ if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
}
cfq_log(cfqd, "dispatched a request");
return 1;
}
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
{
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_data *cfqd = cfqq->cfqd;
- BUG_ON(atomic_read(&cfqq->ref) <= 0);
+ BUG_ON(!cfqq);
- if (!atomic_dec_and_test(&cfqq->ref))
- return;
-
- cfq_log_cfqq(cfqd, cfqq, "put_queue");
+ cfq_log_cfqq(cfqd, cfqq, "free_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq_is_active_queue(cfqq))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
kmem_cache_free(cfq_pool, cfqq);
}
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+ elv_put_ioq(cfqq->ioq);
+}
+
/*
* Must always be called with the rcu_read_lock() held
*/
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}
cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
- cfqd->queue->node);
+ q->node);
if (cic) {
cic->last_end_request = jiffies;
INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
- int ioprio_class;
+ int ioprio_class, ioprio;
if (!cfq_cfqq_prio_changed(cfqq))
return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
/*
* no prio set, inherit CPU scheduling settings
*/
- cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ ioprio = task_nice_ioprio(tsk);
+ ioprio_class = task_nice_ioclass(tsk);
break;
case IOPRIO_CLASS_RT:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_RT;
break;
case IOPRIO_CLASS_BE:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
- cfqq->ioprio = 7;
- cfq_clear_cfqq_idle_window(cfqq);
+ ioprio_class = IOPRIO_CLASS_IDLE;
+ ioprio = 7;
+ elv_clear_ioq_idle_window(cfqq->ioq);
break;
}
+ elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+ elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
*/
- cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio = ioprio;
+ cfqq->org_ioprio_class = ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
struct cfq_data *cfqd = cic->key;
struct cfq_queue *cfqq;
unsigned long flags;
+ struct request_queue *q = cfqd->queue;
if (unlikely(!cfqd))
return;
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
if (cfqq)
cfq_mark_cfqq_prio_changed(cfqq);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+ struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
-
+ struct request_queue *q = cfqd->queue;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
retry:
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
if (!cfqq) {
if (new_cfqq) {
- cfqq = new_cfqq;
- new_cfqq = NULL;
+ goto alloc_ioq;
} else if (gfp_mask & __GFP_WAIT) {
/*
* Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
if (!cfqq)
goto out;
}
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ cfqq = new_cfqq;
+ new_cfqq = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q,
+ gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ kmem_cache_free(cfq_pool, cfqq);
+ cfqq = NULL;
+ goto out;
+ }
+ }
- RB_CLEAR_NODE(&cfqq->rb_node);
+ /*
+ * Both cfqq and ioq objects allocated. Do the initializations
+ * now.
+ */
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
-
- atomic_set(&cfqq->ref, 0);
cfqq->cfqd = cfqd;
cfq_mark_cfqq_prio_changed(cfqq);
+ cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
+ elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+ cfqq->org_ioprio, is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
- cfq_mark_cfqq_idle_window(cfqq);
- cfq_mark_cfqq_sync(cfqq);
+ elv_mark_ioq_idle_window(cfqq->ioq);
+ elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
if (new_cfqq)
kmem_cache_free(cfq_pool, new_cfqq);
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
out:
WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
return cfqq;
}
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
- switch (ioprio_class) {
- case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
- case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
- case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
- default:
- BUG();
- }
-}
-
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+ gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
- struct cfq_queue **async_cfqq = NULL;
+ struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
- cfqq = *async_cfqq;
+ async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+ ioprio);
+ cfqq = async_cfqq;
}
if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
return NULL;
}
- /*
- * pin the queue now that it's allocated, scheduler exit will prune it
- */
- if (!is_sync && !(*async_cfqq)) {
- atomic_inc(&cfqq->ref);
- *async_cfqq = cfqq;
- }
+ if (!is_sync && !async_cfqq)
+ io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- atomic_inc(&cfqq->ref);
+ /* ioc reference */
+ elv_get_ioq(cfqq->ioq);
return cfqq;
}
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
{
unsigned long flags;
int ret;
+ struct request_queue *q = cfqd->queue;
ret = radix_tree_preload(gfp_mask);
if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
radix_tree_preload_end();
if (!ret) {
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ spin_lock_irqsave(q->queue_lock, flags);
list_add(&cic->queue_list, &cfqd->cic_list);
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
}
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
{
struct io_context *ioc = NULL;
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
might_sleep_if(gfp_mask & __GFP_WAIT);
- ioc = get_io_context(gfp_mask, cfqd->queue->node);
+ ioc = get_io_context(gfp_mask, q->node);
if (!ioc)
return NULL;
@@ -1889,7 +1494,6 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
-
return cic;
err_free:
cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
}
static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
- unsigned long elapsed = jiffies - cic->last_end_request;
- unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
- cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
- cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
- cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
struct request *rq)
{
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
}
/*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
*/
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+ struct request *rq)
{
- int old_idle, enable_idle;
+ struct cfq_io_context *cic = RQ_CIC(rq);
/*
- * Don't idle for async or idle io prio class
+ * Enabling/Disabling idling based on thinktime has been moved
+ * in common layer.
*/
- if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
- return;
-
- enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
- if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
- enable_idle = 0;
- else if (sample_valid(cic->ttime_samples)) {
- if (cic->ttime_mean > cfqd->cfq_slice_idle)
- enable_idle = 0;
- else
- enable_idle = 1;
- }
+ if (!atomic_read(&cic->ioc->nr_tasks) ||
+ (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+ return 0;
- if (old_idle != enable_idle) {
- cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
- if (enable_idle)
- cfq_mark_cfqq_idle_window(cfqq);
- else
- cfq_clear_cfqq_idle_window(cfqq);
- }
+ return 1;
}
/*
* Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
*/
static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
- struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
{
- struct cfq_queue *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
- cfqq = cfqd->active_queue;
if (!cfqq)
return 0;
- if (cfq_slice_used(cfqq))
- return 1;
-
- if (cfq_class_idle(new_cfqq))
- return 0;
-
- if (cfq_class_idle(cfqq))
- return 1;
-
/*
* if the new request is sync, but the currently running queue is
* not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_meta(rq) && !cfqq->meta_pending)
return 1;
- /*
- * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
- */
- if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
- return 1;
-
- if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+ if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
return 0;
/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
}
/*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
-
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
- cfq_service_tree_add(cfqd, cfqq, 1);
-
- cfqq->slice_end = 0;
- cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
* Called when a new fs request (rq) is added (to cfqq). Check if there's
* something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
*/
static void
cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
{
struct cfq_io_context *cic = RQ_CIC(rq);
- cfqd->rq_queued++;
if (rq_is_meta(rq))
cfqq->meta_pending++;
- cfq_update_io_thinktime(cfqd, cic);
cfq_update_io_seektime(cfqd, cic, rq);
- cfq_update_idle_window(cfqd, cfqq, cic);
cic->last_request_pos = rq->sector + rq->nr_sectors;
-
- if (cfqq == cfqd->active_queue) {
- /*
- * Remember that we saw a request from this process, but
- * don't start queuing just yet. Otherwise we risk seeing lots
- * of tiny requests, because we disrupt the normal plugging
- * and merging. If the request is already larger than a single
- * page, let it rip immediately. For that case we assume that
- * merging is already done. Ditto for a busy system that
- * has other work pending, don't risk delaying until the
- * idle timer unplug to continue working.
- */
- if (cfq_cfqq_wait_request(cfqq)) {
- if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
- cfqd->busy_queues > 1) {
- del_timer(&cfqd->idle_slice_timer);
- blk_start_queueing(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
- }
- } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
- /*
- * not the active queue - expire current slice if it is
- * idle and has expired it's mean thinktime or this new queue
- * has some old slice time left and is of higher priority or
- * this new queue is RT and the current one is BE
- */
- cfq_preempt_queue(cfqd, cfqq);
- blk_start_queueing(cfqd->queue);
- }
}
static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
cfq_rq_enqueued(cfqd, cfqq, rq);
}
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
- if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
- cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
- if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
- cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
- return;
-
- if (cfqd->hw_tag_samples++ < 50)
- return;
-
- if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
- cfqd->hw_tag = 1;
- else
- cfqd->hw_tag = 0;
-
- cfqd->hw_tag_samples = 0;
- cfqd->rq_in_driver_peak = 0;
-}
-
static void cfq_completed_request(struct request_queue *q, struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
now = jiffies;
cfq_log_cfqq(cfqd, cfqq, "complete");
- cfq_update_hw_tag(cfqd);
-
- WARN_ON(!cfqd->rq_in_driver);
- WARN_ON(!cfqq->dispatched);
- cfqd->rq_in_driver--;
- cfqq->dispatched--;
-
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight--;
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
if (sync)
RQ_CIC(rq)->last_end_request = now;
-
- /*
- * If this is the active queue, check if it needs to be expired,
- * or if we want to idle in case it has no pending requests.
- */
- if (cfqd->active_queue == cfqq) {
- const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
- if (cfq_cfqq_slice_new(cfqq)) {
- cfq_set_prio_slice(cfqd, cfqq);
- cfq_clear_cfqq_slice_new(cfqq);
- }
- /*
- * If there are no requests waiting in this queue, and
- * there are other queues ready to issue requests, AND
- * those other queues are issuing requests within our
- * mean seek distance, give them a chance to run instead
- * of idling.
- */
- if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
- else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
- sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
- }
-
- if (!cfqd->rq_in_driver)
- cfq_schedule_dispatch(cfqd);
}
/*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct io_queue *ioq = cfqq->ioq;
+
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
* users of the filesystem
*/
if (cfq_class_idle(cfqq))
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- if (cfqq->ioprio > IOPRIO_NORM)
- cfqq->ioprio = IOPRIO_NORM;
+ elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+ if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+ elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
} else {
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- if (cfqq->ioprio != cfqq->org_ioprio)
- cfqq->ioprio = cfqq->org_ioprio;
+ if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+ elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+ if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+ elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
}
}
static inline int __cfq_may_queue(struct cfq_queue *cfqq)
{
- if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
- !cfq_cfqq_must_alloc_slice(cfqq)) {
+ if ((elv_ioq_wait_request(cfqq->ioq) ||
+ cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
cfq_mark_cfqq_must_alloc_slice(cfqq);
return ELV_MQUEUE_MUST;
}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq->allocated[rw]++;
cfq_clear_cfqq_must_alloc(cfqq);
- atomic_inc(&cfqq->ref);
+ elv_get_ioq(cfqq->ioq);
spin_unlock_irqrestore(q->queue_lock, flags);
rq->elevator_private = cic;
- rq->elevator_private2 = cfqq;
+ rq->ioq = cfqq->ioq;
return 0;
queue_fail:
if (cic)
put_io_context(cic->ioc);
- cfq_schedule_dispatch(cfqd);
+ elv_schedule_dispatch(cfqd->queue);
spin_unlock_irqrestore(q->queue_lock, flags);
cfq_log(cfqd, "set_request fail");
return 1;
}
-static void cfq_kick_queue(struct work_struct *work)
-{
- struct cfq_data *cfqd =
- container_of(work, struct cfq_data, unplug_work);
- struct request_queue *q = cfqd->queue;
-
- spin_lock_irq(q->queue_lock);
- blk_start_queueing(q);
- spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
- struct cfq_data *cfqd = (struct cfq_data *) data;
- struct cfq_queue *cfqq;
- unsigned long flags;
- int timed_out = 1;
-
- cfq_log(cfqd, "idle timer fired");
-
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
- cfqq = cfqd->active_queue;
- if (cfqq) {
- timed_out = 0;
-
- /*
- * We saw a request before the queue expired, let it through
- */
- if (cfq_cfqq_must_dispatch(cfqq))
- goto out_kick;
-
- /*
- * expired
- */
- if (cfq_slice_used(cfqq))
- goto expire;
-
- /*
- * only expire and reinvoke request handler, if there are
- * other queues with pending requests
- */
- if (!cfqd->busy_queues)
- goto out_cont;
-
- /*
- * not expired and it has a request pending, let it dispatch
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto out_kick;
- }
-expire:
- cfq_slice_expired(cfqd, timed_out);
-out_kick:
- cfq_schedule_dispatch(cfqd);
-out_cont:
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
- del_timer_sync(&cfqd->idle_slice_timer);
- cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
- int i;
-
- for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
- }
-
- if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
struct request_queue *q = cfqd->queue;
- cfq_shutdown_timer_wq(cfqd);
-
spin_lock_irq(q->queue_lock);
- if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
while (!list_empty(&cfqd->cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}
- cfq_put_async_queues(cfqd);
-
spin_unlock_irq(q->queue_lock);
-
- cfq_shutdown_timer_wq(cfqd);
-
kfree(cfqd);
}
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;
- cfqd->service_tree = CFQ_RB_ROOT;
-
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->queue = q;
- init_timer(&cfqd->idle_slice_timer);
- cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
- cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
- INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
cfqd->last_end_request = jiffies;
cfqd->cfq_quantum = cfq_quantum;
cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
- cfqd->cfq_slice[0] = cfq_slice_async;
- cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
- cfqd->cfq_slice_idle = cfq_slice_idle;
- cfqd->hw_tag = 1;
return cfqd;
}
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
#undef SHOW_FUNCTION
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
#undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(fifo_expire_async),
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
- CFQ_ATTR(slice_sync),
- CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
- CFQ_ATTR(slice_idle),
__ATTR_NULL
};
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
.elevator_dispatch_fn = cfq_dispatch_requests,
.elevator_add_req_fn = cfq_insert_request,
.elevator_activate_req_fn = cfq_activate_request,
- .elevator_deactivate_req_fn = cfq_deactivate_request,
- .elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
+ .elevator_free_sched_queue_fn = cfq_free_cfq_queue,
+ .elevator_active_ioq_set_fn = cfq_active_ioq_set,
+ .elevator_active_ioq_reset_fn = cfq_active_ioq_reset,
+ .elevator_arm_slice_timer_fn = cfq_arm_slice_timer,
+ .elevator_should_preempt_fn = cfq_should_preempt,
+ .elevator_update_idle_window_fn = cfq_update_idle_window,
+ .elevator_close_cooperator_fn = cfq_close_cooperator,
},
+ .elevator_features = ELV_IOSCHED_NEED_FQ,
.elevator_attrs = cfq_attrs,
.elevator_name = "cfq",
.elevator_owner = THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
static int __init cfq_init(void)
{
- /*
- * could be 0 on HZ < 1000 setups
- */
- if (!cfq_slice_async)
- cfq_slice_async = 1;
- if (!cfq_slice_idle)
- cfq_slice_idle = 1;
-
if (cfq_slab_setup())
return -ENOMEM;
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (3 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
` (16 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/blk-ioc.c | 3 +
block/elevator-fq.c | 1037 +++++++++++++++++++++++++++++++++++++----
block/elevator-fq.h | 149 ++++++-
block/elevator.c | 6 +
include/linux/blkdev.h | 7 +-
include/linux/cgroup_subsys.h | 7 +
include/linux/iocontext.h | 5 +
init/Kconfig | 8 +
8 files changed, 1127 insertions(+), 95 deletions(-)
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
spin_lock_init(&ret->lock);
ret->ioprio_changed = 0;
ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ret->cgroup_changed = 0;
+#endif
ret->last_waited = jiffies; /* doesn't matter... */
ret->nr_batch_requests = 0; /* because this is 0 */
ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
#define ELV_SLICE_SCALE (5)
#define ELV_HW_QUEUE_MIN (5)
+
+#define IO_DEFAULT_GRP_WEIGHT 500
+#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
}
/* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+ int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+ int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+ int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+ struct io_group *iog;
+ struct io_entity *entity, *next_active;
+
+ if (sd->active_entity != NULL)
+ /* will update/requeue at the end of service */
+ return 0;
+
+ /*
+ * NOTE: this can be improved in may ways, such as returning
+ * 1 (and thus propagating upwards the update) only when the
+ * budget changes, or caching the bfqq that will be scheduled
+ * next from this subtree. By now we worry more about
+ * correctness than about performance...
+ */
+ next_active = bfq_lookup_next_entity(sd, 0);
+ sd->next_active = next_active;
+
+ if (next_active != NULL) {
+ iog = container_of(sd, struct io_group, sched_data);
+ entity = iog->my_entity;
+ if (entity != NULL)
+ entity->budget = next_active->budget;
+ }
+
+ return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+ BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+ for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+ return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+ struct io_entity *entity)
+{
+}
+#endif
/*
* Shift for timestamp calculations. This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
bfq_update_active_tree(node);
}
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
- WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
- return IOPRIO_BE_NR - ioprio;
-}
-
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
elv_get_ioq(ioq);
}
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
- entity->ioprio = entity->new_ioprio;
- entity->ioprio_class = entity->new_ioprio_class;
- entity->sched_data = &iog->sched_data;
-}
-
/**
* bfq_find_deepest - find the deepest node that an extraction can modify.
* @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
struct io_queue *ioq = io_entity_to_ioq(entity);
if (entity->ioprio_changed) {
+ old_st->wsum -= entity->weight;
entity->ioprio = entity->new_ioprio;
entity->ioprio_class = entity->new_ioprio_class;
+ entity->weight = entity->new_weight;
entity->ioprio_changed = 0;
/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
entity->budget = elv_prio_to_slice(efqd, ioq);
}
- old_st->wsum -= entity->weight;
- entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
/*
* NOTE: here we may be changing the weight too early,
* this will cause unfairness. The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
if (add_front) {
struct io_entity *next_entity;
- /*
- * Determine the entity which will be dispatched next
- * Use sd->next_active once hierarchical patch is applied
- */
- next_entity = bfq_lookup_next_entity(sd, 0);
+ /* Determine the entity which will be dispatched next */
+ next_entity = sd->next_active;
if (next_entity && next_entity != entity) {
struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
}
/**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
* @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
*/
void bfq_activate_entity(struct io_entity *entity, int add_front)
{
- __bfq_activate_entity(entity, add_front);
+ struct io_sched_data *sd;
+
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, add_front);
+
+ add_front = 0;
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ /*
+ * No need to propagate the activation to the
+ * upper entities, as they will be updated when
+ * the active entity is rescheduled.
+ */
+ break;
+ }
}
/**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
else if (entity->tree != NULL)
BUG();
+ if (was_active || sd->next_active == entity)
+ ret = bfq_update_next_active(sd);
+
if (!requeue || !bfq_gt(entity->finish, st->vtime))
bfq_forget_entity(st, entity);
else
bfq_idle_insert(st, entity);
BUG_ON(sd->active_entity == entity);
+ BUG_ON(sd->next_active == entity);
return ret;
}
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
*/
void bfq_deactivate_entity(struct io_entity *entity, int requeue)
{
- __bfq_deactivate_entity(entity, requeue);
+ struct io_sched_data *sd;
+ struct io_entity *parent;
+
+ for_each_entity_safe(entity, parent) {
+ sd = entity->sched_data;
+
+ if (!__bfq_deactivate_entity(entity, requeue))
+ /*
+ * The parent entity is still backlogged, and
+ * we don't need to update it as it is still
+ * under service.
+ */
+ break;
+
+ if (sd->next_active != NULL)
+ /*
+ * The parent entity is still backlogged and
+ * the budgets on the path towards the root
+ * need to be updated.
+ */
+ goto update;
+
+ /*
+ * If we reach there the parent is no more backlogged and
+ * we want to propagate the dequeue upwards.
+ */
+ requeue = 1;
+ }
+
+ return;
+
+update:
+ entity = parent;
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, 0);
+
+ sd = entity->sched_data;
+ if (!bfq_update_next_active(sd))
+ break;
+ }
}
/**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
entity = __bfq_lookup_next_entity(st);
if (entity != NULL) {
if (extract) {
+ bfq_check_next_active(sd, entity);
bfq_active_extract(st, entity);
sd->active_entity = entity;
+ sd->next_active = NULL;
}
break;
}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
{
struct io_service_tree *st;
- st = io_entity_service_tree(entity);
- entity->service += served;
- BUG_ON(st->wsum == 0);
- st->vtime += bfq_delta(served, st->wsum);
- bfq_forget_idle(st);
+ for_each_entity(entity) {
+ st = io_entity_service_tree(entity);
+ entity->service += served;
+ BUG_ON(st->wsum == 0);
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+ }
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+ .weight = IO_DEFAULT_GRP_WEIGHT,
+ .ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+ struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp. Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ __key = rcu_dereference(iog->key);
+ if (__key == key)
+ return iog;
+ }
+
+ return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ struct cgroup *cgroup;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ cgroup = task_cgroup(current, io_subsys_id);
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, efqd);
+ return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct io_entity *entity = &iog->entity;
+
+ entity->weight = entity->new_weight = iocg->weight;
+ entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+ entity->ioprio_changed = 1;
+ entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+ struct io_entity *entity;
+
+ BUG_ON(parent == NULL);
+ BUG_ON(iog == NULL);
+
+ entity = &iog->entity;
+ entity->parent = parent->my_entity;
+ entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+ struct io_entity *entity = st->first_idle;
+
+ for (; entity != NULL; entity = st->first_idle)
+ __bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct io_cgroup *iocg; \
+ u64 ret; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ spin_lock_irq(&iocg->lock); \
+ ret = iocg->__VAR; \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return ret; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
+ struct cftype *cftype, \
+ u64 val) \
+{ \
+ struct io_cgroup *iocg; \
+ struct io_group *iog; \
+ struct hlist_node *n; \
+ \
+ if (val < (__MIN) || val > (__MAX)) \
+ return -EINVAL; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ \
+ spin_lock_irq(&iocg->lock); \
+ iocg->__VAR = (unsigned long)val; \
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ iog->entity.new_##__VAR = (unsigned long)val; \
+ smp_wmb(); \
+ iog->entity.ioprio_changed = 1; \
+ } \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return 0; \
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *leaf = NULL, *prev = NULL;
+ gfp_t flags = GFP_ATOMIC | __GFP_ZERO;
+
+ for (; cgroup != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL) {
+ /*
+ * All the cgroups in the path from there to the
+ * root must have a bfq_group for bfqd, so we don't
+ * need any more allocations.
+ */
+ break;
+ }
+
+ iog = kzalloc_node(sizeof(*iog), flags, q->node);
+ if (!iog)
+ goto cleanup;
+
+ io_group_init_entity(iocg, iog);
+ iog->my_entity = &iog->entity;
+
+ if (leaf == NULL) {
+ leaf = iog;
+ prev = leaf;
+ } else {
+ io_group_set_parent(prev, iog);
+ /*
+ * Build a list of allocated nodes using the bfqd
+ * filed, that is still unused and will be initialized
+ * only after the node will be connected.
+ */
+ prev->key = iog;
+ prev = iog;
+ }
+ }
+
+ return leaf;
+
+cleanup:
+ while (leaf != NULL) {
+ prev = leaf;
+ leaf = leaf->key;
+ kfree(prev);
+ }
+
+ return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+ struct cgroup *cgroup,
+ struct io_group *leaf,
+ struct elv_fq_data *efqd)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *next, *prev = NULL;
+ unsigned long flags;
+
+ assert_spin_locked(q->queue_lock);
+
+ for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ next = leaf->key;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ BUG_ON(iog != NULL);
+
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ rcu_assign_pointer(leaf->key, key);
+ hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+ spin_unlock_irqrestore(&iocg->lock, flags);
+
+ prev = leaf;
+ leaf = next;
+ }
+
+ BUG_ON(cgroup == NULL && leaf != NULL);
+
+ if (cgroup != NULL && prev != NULL) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, key);
+ io_group_set_parent(prev, iog);
+ }
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary. When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak. If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+ struct cgroup *cgroup, struct elv_fq_data *efqd,
+ int create)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct io_group *iog = NULL;
+ /* Note: Use efqd as key */
+ void *key = efqd;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL || !create)
+ return iog;
+
+ iog = io_group_chain_alloc(q, key, cgroup);
+ if (iog != NULL)
+ io_group_chain_link(q, key, cgroup, iog, efqd);
+
+ return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ struct cgroup *cgroup;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, io_subsys_id);
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ /*
+ * bio merge functions doing lookup don't want to
+ * map bio to root group by default
+ */
+ iog = NULL;
+ }
+ rcu_read_unlock();
+ return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_cgroup *iocg = &io_root_cgroup;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_group *iog = efqd->root_group;
+
+ BUG_ON(!iog);
+ spin_lock_irq(&iocg->lock);
+ hlist_del_rcu(&iog->group_node);
+ spin_unlock_irq(&iocg->lock);
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ struct io_cgroup *iocg;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ iog->entity.parent = NULL;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ iocg = &io_root_cgroup;
+ spin_lock_irq(&iocg->lock);
+ rcu_assign_pointer(iog->key, key);
+ hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+ spin_unlock_irq(&iocg->lock);
+
+ return iog;
+}
+
+struct cftype bfqio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = io_cgroup_weight_read,
+ .write_u64 = io_cgroup_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = io_cgroup_ioprio_class_read,
+ .write_u64 = io_cgroup_ioprio_class_write,
+ },
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, bfqio_files,
+ ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+
+ if (cgroup->parent != NULL) {
+ iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+ if (iocg == NULL)
+ return ERR_PTR(-ENOMEM);
+ } else
+ iocg = &io_root_cgroup;
+
+ spin_lock_init(&iocg->lock);
+ INIT_HLIST_HEAD(&iocg->group_data);
+ iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+ iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+ return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures. By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct task_struct *tsk)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+ /*
+ * ioc == NULL means that the task is either too young or
+ * exiting: if it has still no ioc the ioc can't be shared,
+ * if the task is exiting the attach will fail anyway, no
+ * matter what we return here.
+ */
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+ int busy, resume;
+ struct io_entity *entity = &ioq->entity;
+ struct elv_fq_data *efqd = &e->efqd;
+ struct io_service_tree *st = io_entity_service_tree(entity);
+
+ busy = elv_ioq_busy(ioq);
+ resume = !!ioq->nr_queued;
+
+ BUG_ON(resume && !entity->on_st);
+ BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+ /*
+ * We could be moving an queue which is on idle tree of previous group
+ * What to do? I guess anyway this queue does not have any requests.
+ * just forget the entity and free up from idle tree.
+ *
+ * This needs cleanup. Hackish.
+ */
+ if (entity->tree == &st->idle) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+ bfq_put_idle_entity(st, entity);
+ }
+
+ if (busy) {
+ BUG_ON(atomic_read(&ioq->ref) < 2);
+
+ if (!resume)
+ elv_del_ioq_busy(e, ioq, 0);
+ else
+ elv_deactivate_ioq(efqd, ioq, 0);
+ }
+
+ /*
+ * Here we use a reference to bfqg. We don't need a refcounter
+ * as the cgroup reference will not be dropped, so that its
+ * destroy() callback will not be invoked.
+ */
+ entity->parent = iog->my_entity;
+ entity->sched_data = &iog->sched_data;
+
+ if (busy && resume)
+ elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+ struct elevator_queue *eq;
+ struct io_entity *entity = iog->my_entity;
+ struct io_service_tree *st;
+ int i;
+
+ eq = container_of(efqd, struct elevator_queue, efqd);
+ hlist_del(&iog->elv_data_node);
+ __bfq_deactivate_entity(entity, 0);
+ io_put_io_group_queues(eq, iog);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ /*
+ * The idle tree may still contain bfq_queues belonging
+ * to exited task because they never migrated to a different
+ * cgroup from the one being destroyed now. Noone else
+ * can access them so it's safe to act without any lock.
+ */
+ io_flush_idle_tree(st);
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct elv_fq_data *efqd = NULL;
+ unsigned long uninitialized_var(flags);
+
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
+
+ /*
+ * io groups are linked in two lists. One list is maintained
+ * in elevator (efqd->group_list) and other is maintained
+ * per cgroup structure (iocg->group_data).
+ *
+ * While a cgroup is being deleted, elevator also might be
+ * exiting and both might try to cleanup the same io group
+ * so need to be little careful.
+ *
+ * Following code first accesses efqd under RCU to make sure
+ * iog->key is pointing to valid efqd and then takes the
+ * associated queue lock. After gettting the queue lock it
+ * again checks whether elevator exit path had alreday got
+ * hold of io group (iog->key == NULL). If yes, it does not
+ * try to free up async queues again or flush the idle tree.
+ */
+
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ spin_lock_irqsave(efqd->queue->queue_lock, flags);
+ if (iog->key == efqd)
+ __io_destroy_group(efqd, iog);
+ spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+ }
+ rcu_read_unlock();
+
+ /*
+ * No need to defer the kfree() to the end of the RCU grace
+ * period: we are called from the destroy() callback of our
+ * cgroup, so we can be sure that noone is a) still using
+ * this cgroup or b) doing lookups in it.
+ */
+ kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+
+ /*
+ * Since we are destroying the cgroup, there are no more tasks
+ * referencing it, and all the RCU grace periods that may have
+ * referenced it are ended (as the destruction of the parent
+ * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+ * anything else and we don't need any synchronization.
+ */
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+ io_destroy_group(iocg, iog);
+
+ BUG_ON(!hlist_empty(&iocg->group_data));
+
+ kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+ struct hlist_node *pos, *n;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = &e->efqd;
+
+ hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+ elv_data_node) {
+ hlist_del(&iog->elv_data_node);
+
+ __bfq_deactivate_entity(iog->my_entity, 0);
+
+ /*
+ * Don't remove from the group hash, just set an
+ * invalid key. No lookups can race with the
+ * assignment as bfqd is being destroyed; this
+ * implies also that new elements cannot be added
+ * to the list.
+ */
+ rcu_assign_pointer(iog->key, NULL);
+ io_put_io_group_queues(e, iog);
+ }
+}
+
+struct cgroup_subsys io_subsys = {
+ .name = "io",
+ .create = iocg_create,
+ .can_attach = iocg_can_attach,
+ .attach = iocg_attach,
+ .destroy = iocg_destroy,
+ .populate = iocg_populate,
+ .subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ struct request_queue *q = rq->q;
+ struct io_queue *ioq = rq->ioq;
+ struct io_group *iog, *__iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return 1;
+
+ /* Determine the io group of the bio submitting task */
+ iog = io_get_io_group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a differet cgroup for which io
+ * group has not been setup yet. */
+ return 0;
+ }
+
+ /* Determine the io group of the ioq, rq belongs to*/
+ __iog = ioq_to_io_group(ioq);
+
+ return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+ struct io_group *iog;
+ unsigned long flags;
+
+ /* Make sure io group hierarchy has been setup and also set the
+ * io group to which rq belongs. Later we should make use of
+ * bio cgroup patches to determine the io group */
+ spin_lock_irqsave(q->queue_lock, flags);
+ iog = io_get_io_group(q, 1);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ BUG_ON(!iog);
+
+ /* Store iog in rq. TODO: take care of referencing */
+ rq->iog = iog;
}
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+ entity->ioprio = entity->new_ioprio;
+ entity->weight = entity->new_weight;
+ entity->ioprio_class = entity->new_ioprio_class;
+ entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = &q->elevator->efqd;
+
+ return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd.root_group;
+ io_put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+ return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
/* Elevator fair queuing function */
struct io_queue *rq_ioq(struct request *rq)
{
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
{
+ struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
+ io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
return NULL;
sd = &efqd->root_group->sched_data;
- if (extract)
- entity = bfq_lookup_next_entity(sd, 1);
- else
- entity = bfq_lookup_next_entity(sd, 0);
+ for (; sd != NULL; sd = entity->my_sched_data) {
+ if (extract)
+ entity = bfq_lookup_next_entity(sd, 1);
+ else
+ entity = bfq_lookup_next_entity(sd, 0);
+
+ /*
+ * entity can be null despite the fact that there are busy
+ * queues. if all the busy queues are under a group which is
+ * currently under service.
+ * So if we are just looking for next ioq while something is
+ * being served, null entity is not an error.
+ */
+ BUG_ON(!entity && extract);
+
+ if (extract)
+ entity->service = 0;
- BUG_ON(!entity);
- if (extract)
- entity->service = 0;
+ if (!entity)
+ return NULL;
+ }
ioq = io_entity_to_ioq(entity);
return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
struct request_queue *q = efqd->queue;
if (ioq) {
- elv_log_ioq(efqd, ioq, "set_active, busy=%d",
- efqd->busy_queues);
+ struct io_group *iog = ioq_to_io_group(ioq);
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+ " weight=%ld group_weight=%ld",
+ efqd->busy_queues,
+ ioq->entity.ioprio, ioq->entity.weight,
+ iog_weight(iog));
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
{
struct io_queue *ioq;
struct elevator_queue *eq = q->elevator;
+ struct io_group *iog = NULL, *new_iog = NULL;
ioq = elv_active_ioq(eq);
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
/*
* Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ *
+ * TODO: In hierarchical setup, one need to traverse up the hier
+ * till both the queues are children of same parent to make a
+ * decision whether to do the preemption or not. Something like
+ * what cfs has done for cpu scheduler. Will do it little later.
*/
if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
return 1;
+ iog = ioq_to_io_group(ioq);
+ new_iog = ioq_to_io_group(new_ioq);
+
/*
- * Check with io scheduler if it has additional criterion based on
- * which it wants to preempt existing queue.
+ * If both the queues belong to same group, check with io scheduler
+ * if it has additional criterion based on which it wants to
+ * preempt existing queue.
*/
+ if (iog != new_iog)
+ return 0;
+
if (eq->ops->elevator_should_preempt_fn)
return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_schedule_dispatch(q);
}
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
- struct elv_fq_data *efqd = &q->elevator->efqd;
-
- return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
int ioprio)
{
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
}
EXPORT_SYMBOL(io_group_set_async_queue);
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
- int i, j;
-
- for (i = 0; i < 2; i++)
- for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
-
- /* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
- struct elevator_queue *e, void *key)
-{
- struct io_group *iog;
- int i;
-
- iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
- if (iog == NULL)
- return NULL;
-
- for (i = 0; i < IO_IOPRIO_CLASSES; i++)
- iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
- return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
- struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
- kfree(iog);
-}
-
static void elv_slab_kill(void)
{
/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
INIT_LIST_HEAD(&efqd->idle_list);
+ INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
spin_lock_irq(q->queue_lock);
/* This should drop all the idle tree references of ioq */
elv_free_idle_ioq_list(e);
+ /* This should drop all the io group references of async queues */
+ io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
elv_shutdown_timer_wq(e);
+ /* Wait for iog->key accessors to exit their grace periods. */
+ synchronize_rcu();
BUG_ON(timer_pending(&efqd->idle_slice_timer));
io_free_root_group(e);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
*/
#include <linux/blkdev.h>
+#include <linux/cgroup.h>
#ifndef _BFQ_SCHED_H
#define _BFQ_SCHED_H
#define IO_IOPRIO_CLASSES 3
+#define WEIGHT_MAX 1000
typedef u64 bfq_timestamp_t;
typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
*/
struct io_sched_data {
struct io_entity *active_entity;
+ struct io_entity *next_active;
struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
};
@@ -84,13 +87,12 @@ struct io_sched_data {
* this entity; used for O(log N) lookups into active trees.
* @service: service received during the last round of service.
* @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
* @parent: parent entity, for hierarchical scheduling.
* @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
* associated scheduler queue, %NULL on leaf nodes.
* @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
* @ioprio_class: the ioprio_class in use.
* @new_ioprio_class: when an ioprio_class change is requested, the new
* ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
bfq_timestamp_t min_start;
bfq_service_t service, budget;
- bfq_weight_t weight;
struct io_entity *parent;
struct io_sched_data *my_sched_data;
struct io_sched_data *sched_data;
+ bfq_weight_t weight, new_weight;
unsigned short ioprio, new_ioprio;
unsigned short ioprio_class, new_ioprio_class;
@@ -180,6 +182,75 @@ struct io_queue {
void *sched_queue;
};
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ * both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ * list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ * of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ * the group, one queue per ioprio value per ioprio_class,
+ * except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ * to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ * o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ * via RCU from its readers.
+ * o @bfqd is protected by the queue lock, RCU is used to access it
+ * from the readers.
+ * o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+ struct io_entity entity;
+ struct hlist_node elv_data_node;
+ struct hlist_node group_node;
+ struct io_sched_data sched_data;
+
+ struct io_entity *my_entity;
+
+ /*
+ * A cgroup has multiple io_groups, one for each request queue.
+ * to find io group belonging to a particular queue, elv_fq_data
+ * pointer is stored as a key.
+ */
+ void *key;
+
+ /* async_queue and idle_queue are used only for cfq */
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+ struct cgroup_subsys_state css;
+
+ unsigned long weight, ioprio_class;
+
+ spinlock_t lock;
+ struct hlist_head group_data;
+};
+#else
struct io_group {
struct io_sched_data sched_data;
@@ -187,10 +258,14 @@ struct io_group {
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
};
+#endif
struct elv_fq_data {
struct io_group *root_group;
+ /* List of io groups hanging on this elevator */
+ struct hlist_head group_list;
+
/* List of io queues on idle tree. */
struct list_head idle_list;
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
ioq->entity.ioprio_changed = 1;
}
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
{
ioq->entity.new_ioprio = ioprio;
+ ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
ioq->entity.ioprio_changed = 1;
}
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
sched_data);
}
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+ struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+ /* Just root group is present and weight is immaterial. */
+ return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
/* Functions used by blksysfs.c */
extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
{
return NULL;
}
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+ return 1;
+}
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_integrity(bio) != blk_integrity_rq(rq))
return 0;
+ /* If rq and bio belongs to different groups, dont allow merging */
+ if (!io_group_allow_merge(rq, bio))
+ return 0;
+
if (!elv_iosched_allow_merge(rq, bio))
return 0;
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_set_request_io_group(q, rq);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
#ifdef CONFIG_ELV_FAIR_QUEUING
/* io queue request belongs to */
struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* io group request belongs to */
+ struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
};
static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
#endif
/* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;
+#ifdef CONFIG_GROUP_IOSCHED
+ /* If task changes the cgroup, elevator processes it asynchronously */
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.
+config GROUP_IOSCHED
+ bool "Group IO Scheduler"
+ depends on CGROUPS && ELV_FAIR_QUEUING
+ default n
+ ---help---
+ This feature lets IO scheduler recognize task groups and control
+ disk bandwidth allocation to such task groups.
+
endif # CGROUPS
config MM_OWNER
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (4 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
` (15 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Make cfq hierarhical.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 8 ++++++++
block/cfq-iosched.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
init/Kconfig | 2 +-
3 files changed, 57 insertions(+), 1 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
working environment, suitable for desktop systems.
This is the default I/O scheduler.
+config IOSCHED_CFQ_HIER
+ bool "CFQ Hierarchical Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in cfq.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
ioc->ioprio_changed = 0;
}
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ struct io_group *iog, *__iog;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ iog = io_lookup_io_group_current(q);
+
+ if (async_cfqq != NULL) {
+ __iog = cfqq_to_io_group(async_cfqq);
+
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue(async_cfqq);
+ }
+ }
+
+ if (sync_cfqq != NULL) {
+ __iog = cfqq_to_io_group(sync_cfqq);
+ if (iog != __iog)
+ io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_IOSCHED_CFQ_HIER */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
size is 4096bytes, 512k per 1Gbytes of swap.
config GROUP_IOSCHED
- bool "Group IO Scheduler"
+ bool
depends on CGROUPS && ELV_FAIR_QUEUING
default n
---help---
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (5 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
` (14 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o This patch exports some statistics through cgroup interface. Two of the
statistics currently exported are actual disk time assigned to the cgroup
and actual number of sectors dispatched to disk on behalf of this cgroup.
o Currently these numbers are aggregate. That means it is for all the tasks
in that cgroup on all the disks. Later may be it will help to get per
disk statistics also.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/elevator-fq.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 7 ++++
2 files changed, 106 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
return entity;
}
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+ bfq_service_t nr_sectors)
{
struct io_service_tree *st;
for_each_entity(entity) {
st = io_entity_service_tree(entity);
entity->service += served;
+ entity->total_service += served;
+ entity->total_sector_service += nr_sectors;
BUG_ON(st->wsum == 0);
st->vtime += bfq_delta(served, st->wsum);
bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_time = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_time += iog->entity.total_service;
+ }
+ rcu_read_unlock();
+
+ return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ u64 disk_sectors = 0;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key))
+ disk_sectors += iog->entity.total_sector_service;
+ }
+ rcu_read_unlock();
+
+ return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io_cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ ret = calculate_aggr_disk_sectors(iocg);
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+ return ret;
+}
+
/**
* bfq_group_chain_alloc - allocate a chain of groups.
* @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
.read_u64 = io_cgroup_ioprio_class_read,
.write_u64 = io_cgroup_ioprio_class_write,
},
+ {
+ .name = "disk_time",
+ .read_u64 = io_cgroup_disk_time_read,
+ },
+ {
+ .name = "disk_sectors",
+ .read_u64 = io_cgroup_disk_sectors_read,
+ },
};
int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
- entity_served(&ioq->entity, served);
+ entity_served(&ioq->entity, served, ioq->nr_sectors);
}
/* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
unsigned short ioprio_class, new_ioprio_class;
int ioprio_changed;
+
+ /*
+ * Keep track of total service received by this entity. Keep the
+ * stats both for time slices and number of sectors dispatched
+ */
+ unsigned long total_service;
+ unsigned long total_sector_service;
};
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (6 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
` (13 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o When a sync queue expires, in many cases it might be empty and then
it will be deleted from the active tree. This will lead to a scenario
where out of two competing queues, only one is on the tree and when a
new queue is selected, vtime jump takes place and we don't see services
provided in proportion to weight.
o In general this is a fundamental problem with fairness of sync queues
where queues are not continuously backlogged. Looks like idling is
only solution to make sure such kind of queues can get some decent amount
of disk bandwidth in the face of competion from continusouly backlogged
queues. But excessive idling has potential to reduce performance on SSD
and disks with commnad queuing.
o This patch experiments with waiting for next request to come before a
queue is expired after it has consumed its time slice. This can ensure
more accurate fairness numbers in some cases.
o Introduced a tunable "fairness". If set, io-controller will put more
focus on getting fairness right than getting throughput right.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/blk-sysfs.c | 7 +++
block/elevator-fq.c | 117 +++++++++++++++++++++++++++++++++++++++++++++-----
block/elevator-fq.h | 12 +++++
3 files changed, 124 insertions(+), 12 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
.show = elv_slice_async_show,
.store = elv_slice_async_store,
};
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+ .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+ .show = elv_fairness_show,
+ .store = elv_fairness_store,
+};
#endif
static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
&queue_slice_idle_entry.attr,
&queue_slice_sync_entry.attr,
&queue_slice_async_entry.attr,
+ &queue_fairness_entry.attr,
#endif
NULL,
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
return (ioq->entity.sched_data == &efqd->root_group->sched_data);
}
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ data = efqd->fairness;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count)
+{
+ struct elv_fq_data *efqd;
+ unsigned int data;
+ unsigned long flags;
+
+ char *p = (char *)name;
+
+ data = simple_strtoul(p, &p, 10);
+
+ if (data < 0)
+ data = 0;
+ else if (data > INT_MAX)
+ data = INT_MAX;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ efqd = &q->elevator->efqd;
+ efqd->fairness = data;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return count;
+}
+
/* Functions to show and store elv_idle_slice value through sysfs */
ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
{
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
* io scheduler if it wants to disable idling based on additional
* considrations like seek pattern.
*/
- if (enable_idle) {
+ if (enable_idle && !efqd->fairness) {
if (eq->ops->elevator_update_idle_window_fn)
enable_idle = eq->ops->elevator_update_idle_window_fn(
eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
assert_spin_locked(q->queue_lock);
elv_log_ioq(efqd, ioq, "slice expired");
- if (elv_ioq_wait_request(ioq))
+ if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
del_timer(&efqd->idle_slice_timer);
elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_wait_busy(ioq);
/*
* if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
* has other work pending, don't risk delaying until the
* idle timer unplug to continue working.
*/
- if (elv_ioq_wait_request(ioq)) {
+ if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
efqd->busy_queues > 1) {
del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
}
elv_mark_ioq_must_dispatch(ioq);
}
+
+ /*
+ * If we were waiting for a request on this queue, wait is
+ * done. Schedule the next dispatch
+ */
+ if (elv_ioq_wait_busy(ioq)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_ioq_wait_busy(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_schedule_dispatch(q);
+ }
} else if (elv_should_preempt(q, ioq, rq)) {
/*
* not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
if (ioq) {
+ if (elv_ioq_wait_busy(ioq))
+ goto expire;
+
/*
* We saw a request before the queue expired, let it through
*/
@@ -2631,7 +2684,7 @@ out_cont:
spin_unlock_irqrestore(q->queue_lock, flags);
}
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk_queue_nonrot(q) && efqd->hw_tag)
+ if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
return;
/*
- * still requests with the driver, don't idle
+ * idle is disabled, either manually or by past process history
*/
- if (efqd->rq_in_driver)
+ if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
return;
/*
- * idle is disabled, either manually or by past process history
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
*/
- if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+ if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+ elv_mark_ioq_wait_busy(ioq);
+ sl = efqd->elv_slice_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+ return;
+ }
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq_in_driver && !efqd->fairness)
return;
/*
* may be iosched got its own idling logic. In that case io
* schduler will take care of arming the timer, if need be.
*/
- if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+ if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
q->elevator->ops->elevator_arm_slice_timer_fn(q,
ioq->sched_queue);
} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto expire;
}
+ /* We are waiting for this queue to become busy before it expires.*/
+ if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_ioq_set_prio_slice(q, ioq);
elv_clear_ioq_slice_new(ioq);
}
+
+ if (elv_ioq_class_idle(ioq)) {
+ elv_ioq_slice_expired(q);
+ goto done;
+ }
+
+ if (efqd->fairness && sync && !ioq->nr_queued) {
+ /*
+ * If fairness is enabled, wait for one extra idle
+ * period in the hope that this queue will get
+ * backlogged again
+ */
+ if (elv_ioq_slice_used(ioq))
+ elv_ioq_arm_slice_timer(q, 1);
+ else
+ elv_ioq_arm_slice_timer(q, 0);
+ goto done;
+ }
+
/*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ if (elv_ioq_slice_used(ioq))
elv_ioq_slice_expired(q);
else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
- elv_ioq_arm_slice_timer(q);
+ elv_ioq_arm_slice_timer(q, 0);
}
+done:
if (!efqd->rq_in_driver)
elv_schedule_dispatch(q);
}
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
efqd->elv_slice_idle = elv_slice_idle;
efqd->hw_tag = 1;
+ /* For the time being keep fairness enabled by default */
+ efqd->fairness = 1;
return 0;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
unsigned long long rate_sampling_start; /*sampling window start jifies*/
/* number of sectors finished io during current sampling window */
unsigned long rate_sectors_current;
+
+ /*
+ * If set to 1, will disable many optimizations done for boost
+ * throughput and focus more on providing fairness for sync
+ * queues.
+ */
+ int fairness;
};
extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
+ ELV_QUEUE_FLAG_wait_busy, /* wait for this queue to get busy */
ELV_QUEUE_FLAG_NR,
};
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
static inline struct io_service_tree *
io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+ size_t count);
/* Functions used by elevator.c */
extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 09/18] io-controller: Separate out queue and data
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (7 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (12 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o So far noop, deadline and AS had one common structure called *_data which
contained both the queue information where requests are queued and also
common data used for scheduling. This patch breaks down this common
structure in two parts, *_queue and *_data. This is along the lines of
cfq where all the reuquests are queued in queue and common data and tunables
are part of data.
o It does not change the functionality but this re-organization helps once
noop, deadline and AS are changed to use hierarchical fair queuing.
o looks like queue_empty function is not required and we can check for
q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
not.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/as-iosched.c | 208 ++++++++++++++++++++++++++--------------------
block/deadline-iosched.c | 117 ++++++++++++++++----------
block/elevator.c | 111 +++++++++++++++++++++----
block/noop-iosched.c | 59 ++++++-------
include/linux/elevator.h | 8 ++-
5 files changed, 319 insertions(+), 184 deletions(-)
diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
* or timed out */
};
-struct as_data {
- /*
- * run time data
- */
-
- struct request_queue *q; /* the "owner" queue */
-
+struct as_queue {
/*
* requests (as_rq s) are present on both sort_list and fifo_list
*/
@@ -90,6 +84,14 @@ struct as_data {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+ unsigned long last_check_fifo[2];
+ int write_batch_count; /* max # of reqs in a write batch */
+ int current_write_count; /* how many requests left this batch */
+ int write_batch_idled; /* has the write batch gone idle? */
+};
+
+struct as_data {
+ struct request_queue *q; /* the "owner" queue */
sector_t last_sector[2]; /* last SYNC & ASYNC sectors */
unsigned long exit_prob; /* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
sector_t new_seek_mean;
unsigned long current_batch_expires;
- unsigned long last_check_fifo[2];
int changed_batch; /* 1: waiting for old batch to end */
int new_batch; /* 1: waiting on first read complete */
- int batch_data_dir; /* current batch SYNC / ASYNC */
- int write_batch_count; /* max # of reqs in a write batch */
- int current_write_count; /* how many requests left this batch */
- int write_batch_idled; /* has the write batch gone idle? */
enum anticipation_status antic_status;
unsigned long antic_start; /* jiffies: when it started */
struct timer_list antic_timer; /* anticipatory scheduling timer */
- struct work_struct antic_work; /* Deferred unplugging */
+ struct work_struct antic_work; /* Deferred unplugging */
struct io_context *io_context; /* Identify the expected process */
int ioc_finished; /* IO associated with io_context is finished */
int nr_dispatched;
+ int batch_data_dir; /* current batch SYNC / ASYNC */
/*
* settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
/*
* rb tree support functions
*/
-#define RQ_RB_ROOT(ad, rq) (&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq) (&(asq)->sort_list[rq_is_sync((rq))])
static void as_add_rq_rb(struct as_data *ad, struct request *rq)
{
struct request *alias;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
- while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+ while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
as_move_to_dispatch(ad, alias);
as_antic_stop(ad);
}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
{
- elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+ elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
}
/*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
* what request to process next. Anticipation works on top of this.
*/
static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
{
struct rb_node *rbnext = rb_next(&last->rb_node);
struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
else {
const int data_dir = rq_is_sync(last);
- rbnext = rb_first(&ad->sort_list[data_dir]);
+ rbnext = rb_first(&asq->sort_list[data_dir]);
if (rbnext && rbnext != &last->rb_node)
next = rb_entry_rq(rbnext);
}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
static void as_update_rq(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
/* keep the next_rq cache up to date */
- ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+ asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
/*
* have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
{
unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
long write_time;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
write_time = (jiffies - ad->current_batch_expires) + batch;
if (write_time < 0)
write_time = 0;
- if (write_time > batch && !ad->write_batch_idled) {
+ if (write_time > batch && !asq->write_batch_idled) {
if (write_time > batch * 3)
- ad->write_batch_count /= 2;
+ asq->write_batch_count /= 2;
else
- ad->write_batch_count--;
- } else if (write_time < batch && ad->current_write_count == 0) {
+ asq->write_batch_count--;
+ } else if (write_time < batch && asq->current_write_count == 0) {
if (batch > write_time * 3)
- ad->write_batch_count *= 2;
+ asq->write_batch_count *= 2;
else
- ad->write_batch_count++;
+ asq->write_batch_count++;
}
- if (ad->write_batch_count < 1)
- ad->write_batch_count = 1;
+ if (asq->write_batch_count < 1)
+ asq->write_batch_count = 1;
}
/*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
const int data_dir = rq_is_sync(rq);
struct as_data *ad = q->elevator->elevator_data;
struct io_context *ioc;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
* Update the "next_rq" cache if we are about to remove its
* entry
*/
- if (ad->next_rq[data_dir] == rq)
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ if (asq->next_rq[data_dir] == rq)
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
rq_fifo_clear(rq);
as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
*
* See as_antic_expired comment.
*/
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
{
struct request *rq;
long delta_jif;
- delta_jif = jiffies - ad->last_check_fifo[adir];
+ delta_jif = jiffies - asq->last_check_fifo[adir];
if (unlikely(delta_jif < 0))
delta_jif = -delta_jif;
if (delta_jif < ad->fifo_expire[adir])
return 0;
- ad->last_check_fifo[adir] = jiffies;
+ asq->last_check_fifo[adir] = jiffies;
- if (list_empty(&ad->fifo_list[adir]))
+ if (list_empty(&asq->fifo_list[adir]))
return 0;
- rq = rq_entry_fifo(ad->fifo_list[adir].next);
+ rq = rq_entry_fifo(asq->fifo_list[adir].next);
return time_after(jiffies, rq_fifo_time(rq));
}
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
* as_batch_expired returns true if the current batch has expired. A batch
* is a set of reads or a set of writes.
*/
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
{
if (ad->changed_batch || ad->new_batch)
return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
return time_after(jiffies, ad->current_batch_expires);
return time_after(jiffies, ad->current_batch_expires)
- || ad->current_write_count == 0;
+ || asq->current_write_count == 0;
}
/*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
ad->io_context = NULL;
}
- if (ad->current_write_count != 0)
- ad->current_write_count--;
+ if (asq->current_write_count != 0)
+ asq->current_write_count--;
}
ad->ioc_finished = 0;
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
/*
* take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
static int as_dispatch_request(struct request_queue *q, int force)
{
struct as_data *ad = q->elevator->elevator_data;
- const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
- const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
struct request *rq;
+ struct as_queue *asq = elv_select_sched_queue(q, force);
+ int reads, writes;
+
+ if (!asq)
+ return 0;
+
+ reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+ writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
if (unlikely(force)) {
/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 0;
ad->new_batch = 0;
- while (ad->next_rq[BLK_RW_SYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+ while (asq->next_rq[BLK_RW_SYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
- while (ad->next_rq[BLK_RW_ASYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+ while (asq->next_rq[BLK_RW_ASYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
return dispatched;
}
/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
- if (ad->current_write_count == 0 || !writes)
- ad->write_batch_idled = 1;
+ if (asq->current_write_count == 0 || !writes)
+ asq->write_batch_idled = 1;
}
if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
|| ad->changed_batch)
return 0;
- if (!(reads && writes && as_batch_expired(ad))) {
+ if (!(reads && writes && as_batch_expired(ad, asq))) {
/*
* batch is still running or no reads or no writes
*/
- rq = ad->next_rq[ad->batch_data_dir];
+ rq = asq->next_rq[ad->batch_data_dir];
if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
- if (as_fifo_expired(ad, BLK_RW_SYNC))
+ if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
goto fifo_expired;
if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
if (writes && ad->batch_data_dir == BLK_RW_SYNC)
/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 1;
}
ad->batch_data_dir = BLK_RW_SYNC;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
- ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+ asq->last_check_fifo[ad->batch_data_dir] = jiffies;
goto dispatch_request;
}
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
if (ad->batch_data_dir == BLK_RW_SYNC) {
ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
ad->new_batch = 0;
}
ad->batch_data_dir = BLK_RW_ASYNC;
- ad->current_write_count = ad->write_batch_count;
- ad->write_batch_idled = 0;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->current_write_count = asq->write_batch_count;
+ asq->write_batch_idled = 0;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
goto dispatch_request;
}
@@ -1150,9 +1162,9 @@ dispatch_request:
* If a request has expired, service it.
*/
- if (as_fifo_expired(ad, ad->batch_data_dir)) {
+ if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
fifo_expired:
- rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+ rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
}
if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
int data_dir;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);
RQ_SET_STATE(rq, AS_RQ_NEW);
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
as_update_rq(ad, rq); /* keep state machine up to date */
RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
}
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
- struct as_data *ad = q->elevator->elevator_data;
-
- return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
- && list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
static int
as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
- struct as_data *ad = q->elevator->elevator_data;
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
+ struct as_queue *asq = elv_get_sched_queue_current(q);
+
+ if (!asq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
*/
- __rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+ __rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
return ret;
}
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
+{
+ struct as_queue *asq;
+ struct as_data *ad = eq->elevator_data;
+
+ asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+ if (asq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+ asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+ asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+ if (ad)
+ asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+ else
+ asq->write_batch_count = default_write_batch_expire / 10;
+
+ if (asq->write_batch_count < 2)
+ asq->write_batch_count = 2;
+out:
+ return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+ struct as_queue *asq = sched_queue;
+
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+ kfree(asq);
+}
+
static void as_exit_queue(struct elevator_queue *e)
{
struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
del_timer_sync(&ad->antic_timer);
cancel_work_sync(&ad->antic_work);
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
put_io_context(ad->io_context);
kfree(ad);
}
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
init_timer(&ad->antic_timer);
INIT_WORK(&ad->antic_work, as_work_handler);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
- ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
- ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
- ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
- if (ad->write_batch_count < 2)
- ad->write_batch_count = 2;
return ad;
}
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
.elevator_add_req_fn = as_add_request,
.elevator_activate_req_fn = as_activate_request,
.elevator_deactivate_req_fn = as_deactivate_request,
- .elevator_queue_empty_fn = as_queue_empty,
.elevator_completed_req_fn = as_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
.elevator_init_fn = as_init_queue,
.elevator_exit_fn = as_exit_queue,
.trim = as_trim,
+ .elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+ .elevator_free_sched_queue_fn = as_free_as_queue,
},
.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */
-struct deadline_data {
- /*
- * run time data
- */
-
+struct deadline_queue {
/*
* requests (deadline_rq s) are present on both sort_list and fifo_list
*/
- struct rb_root sort_list[2];
+ struct rb_root sort_list[2];
struct list_head fifo_list[2];
-
/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching; /* number of sequential requests made */
- sector_t last_sector; /* head position */
unsigned int starved; /* times reads have starved writes */
+};
+struct deadline_data {
+ struct request_queue *q;
+ sector_t last_sector; /* head position */
/*
* settings that change how the i/o scheduler behaves
*/
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
static inline struct rb_root *
deadline_rb_root(struct deadline_data *dd, struct request *rq)
{
- return &dd->sort_list[rq_data_dir(rq)];
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+ return &dq->sort_list[rq_data_dir(rq)];
}
/*
@@ -87,9 +87,10 @@ static inline void
deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- if (dd->next_rq[data_dir] == rq)
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ if (dq->next_rq[data_dir] == rq)
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
elv_rb_del(deadline_rb_root(dd, rq), rq);
}
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(q, rq);
deadline_add_rq_rb(dd, rq);
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
}
/*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
struct deadline_data *dd = q->elevator->elevator_data;
struct request *__rq;
int ret;
+ struct deadline_queue *dq;
+
+ dq = elv_get_sched_queue_current(q);
+ if (!dq)
+ return ELEVATOR_NO_MERGE;
/*
* check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
if (dd->front_merges) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
- __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ __rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
if (__rq) {
BUG_ON(sector != __rq->sector);
@@ -207,10 +214,11 @@ static void
deadline_move_request(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
- dd->next_rq[READ] = NULL;
- dd->next_rq[WRITE] = NULL;
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ dq->next_rq[READ] = NULL;
+ dq->next_rq[WRITE] = NULL;
+ dq->next_rq[data_dir] = deadline_latter_request(rq);
dd->last_sector = rq_end_sector(rq);
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
* deadline_check_fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
*/
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
{
- struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+ struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
/*
* rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int reads = !list_empty(&dd->fifo_list[READ]);
- const int writes = !list_empty(&dd->fifo_list[WRITE]);
+ struct deadline_queue *dq = elv_select_sched_queue(q, force);
+ int reads, writes;
struct request *rq;
int data_dir;
+ if (!dq)
+ return 0;
+
+ reads = !list_empty(&dq->fifo_list[READ]);
+ writes = !list_empty(&dq->fifo_list[WRITE]);
+
/*
* batches are currently reads XOR writes
*/
- if (dd->next_rq[WRITE])
- rq = dd->next_rq[WRITE];
+ if (dq->next_rq[WRITE])
+ rq = dq->next_rq[WRITE];
else
- rq = dd->next_rq[READ];
+ rq = dq->next_rq[READ];
- if (rq && dd->batching < dd->fifo_batch)
+ if (rq && dq->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
goto dispatch_request;
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dq->starved++ >= dd->writes_starved))
goto dispatch_writes;
data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
- dd->starved = 0;
+ dq->starved = 0;
data_dir = WRITE;
@@ -299,48 +313,62 @@ dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
- if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
- rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
- rq = dd->next_rq[data_dir];
+ rq = dq->next_rq[data_dir];
}
- dd->batching = 0;
+ dq->batching = 0;
dispatch_request:
/*
* rq is the selected appropriate request.
*/
- dd->batching++;
+ dq->batching++;
deadline_move_request(dd, rq);
return 1;
}
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct deadline_data *dd = q->elevator->elevator_data;
+ struct deadline_queue *dq;
- return list_empty(&dd->fifo_list[WRITE])
- && list_empty(&dd->fifo_list[READ]);
+ dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+ if (dq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&dq->fifo_list[READ]);
+ INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+ dq->sort_list[READ] = RB_ROOT;
+ dq->sort_list[WRITE] = RB_ROOT;
+out:
+ return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+ void *sched_queue)
+{
+ struct deadline_queue *dq = sched_queue;
+
+ kfree(dq);
}
static void deadline_exit_queue(struct elevator_queue *e)
{
struct deadline_data *dd = e->elevator_data;
- BUG_ON(!list_empty(&dd->fifo_list[READ]));
- BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
kfree(dd);
}
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
if (!dd)
return NULL;
- INIT_LIST_HEAD(&dd->fifo_list[READ]);
- INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
- dd->sort_list[READ] = RB_ROOT;
- dd->sort_list[WRITE] = RB_ROOT;
+ dd->q = q;
dd->fifo_expire[READ] = read_expire;
dd->fifo_expire[WRITE] = write_expire;
dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
.elevator_merge_req_fn = deadline_merged_requests,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn = deadline_add_request,
- .elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_init_fn = deadline_init_queue,
.elevator_exit_fn = deadline_exit_queue,
+ .elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+ .elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
-
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
return e;
}
-static void *elevator_init_queue(struct request_queue *q,
- struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+ struct elevator_queue *eq)
{
- return eq->ops->elevator_init_fn(q);
+ void *data = NULL;
+
+ if (eq->ops->elevator_init_fn) {
+ data = eq->ops->elevator_init_fn(q);
+ if (data)
+ return data;
+ else
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* IO scheduler does not instanciate data (noop), it is not an error */
+ return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+ void *sched_queue)
+{
+ /* Not all io schedulers (cfq) strore sched_queue */
+ if (!sched_queue)
+ return;
+ eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+ struct elevator_queue *eq)
+{
+ void *sched_queue = NULL;
+
+ if (eq->ops->elevator_alloc_sched_queue_fn) {
+ sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+ GFP_KERNEL);
+ if (!sched_queue)
+ return ERR_PTR(-ENOMEM);
+
+ }
+
+ return sched_queue;
}
static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
- void *data)
+ void *data, void *sched_queue)
{
q->elevator = eq;
eq->elevator_data = data;
+ eq->sched_queue = sched_queue;
}
static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
struct elevator_type *e = NULL;
struct elevator_queue *eq;
int ret = 0;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
INIT_LIST_HEAD(&q->queue_head);
q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
if (!eq)
return -ENOMEM;
- data = elevator_init_queue(q, eq);
- if (!data) {
+ data = elevator_init_data(q, eq);
+
+ if (IS_ERR(data)) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, eq);
+
+ if (IS_ERR(sched_queue)) {
kobject_put(&eq->kobj);
return -ENOMEM;
}
- elevator_attach(q, eq, data);
+ elevator_attach(q, eq, data, sched_queue);
return ret;
}
EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elevator_free_sched_queue(e, e->sched_queue);
elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old_elevator, *e;
- void *data;
+ void *data = NULL, *sched_queue = NULL;
/*
* Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (!e)
return 0;
- data = elevator_init_queue(q, e);
- if (!data) {
+ data = elevator_init_data(q, e);
+
+ if (IS_ERR(data)) {
kobject_put(&e->kobj);
- return 0;
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, e);
+
+ if (IS_ERR(sched_queue)) {
+ kobject_put(&e->kobj);
+ return -ENOMEM;
}
/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
/*
* attach and start new elevator
*/
- elevator_attach(q, e, data);
+ elevator_attach(q, e, data, sched_queue);
spin_unlock_irq(q->queue_lock);
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
}
EXPORT_SYMBOL(elv_rb_latter_request);
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
{
- return ioq_sched_queue(rq_ioq(rq));
+ /*
+ * io scheduler is not using fair queuing. Return sched_queue
+ * pointer stored in elevator_queue. It will be null if io
+ * scheduler never stored anything there to begin with (cfq)
+ */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ /*
+ * IO schedueler is using fair queuing infrasture. If io scheduler
+ * has passed a non null rq, retrieve sched_queue pointer from
+ * there. */
+ if (rq)
+ return ioq_sched_queue(rq_ioq(rq));
+
+ return NULL;
}
EXPORT_SYMBOL(elv_get_sched_queue);
/* Select an ioscheduler queue to dispatch request from. */
void *elv_select_sched_queue(struct request_queue *q, int force)
{
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
return ioq_sched_queue(elv_fq_select_ioq(q, force));
}
EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+ return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
#include <linux/module.h>
#include <linux/init.h>
-struct noop_data {
+struct noop_queue {
struct list_head queue;
};
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
static int noop_dispatch(struct request_queue *q, int force)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_select_sched_queue(q, force);
- if (!list_empty(&nd->queue)) {
+ if (!nq)
+ return 0;
+
+ if (!list_empty(&nq->queue)) {
struct request *rq;
- rq = list_entry(nd->queue.next, struct request, queuelist);
+ rq = list_entry(nq->queue.next, struct request, queuelist);
list_del_init(&rq->queuelist);
elv_dispatch_sort(q, rq);
return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
static void noop_add_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
- struct noop_data *nd = q->elevator->elevator_data;
-
- return list_empty(&nd->queue);
+ list_add_tail(&rq->queuelist, &nq->queue);
}
static struct request *
noop_former_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.prev == &nd->queue)
+ if (rq->queuelist.prev == &nq->queue)
return NULL;
return list_entry(rq->queuelist.prev, struct request, queuelist);
}
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
static struct request *
noop_latter_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);
- if (rq->queuelist.next == &nd->queue)
+ if (rq->queuelist.next == &nq->queue)
return NULL;
return list_entry(rq->queuelist.next, struct request, queuelist);
}
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct noop_data *nd;
+ struct noop_queue *nq;
- nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
- if (!nd)
- return NULL;
- INIT_LIST_HEAD(&nd->queue);
- return nd;
+ nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+ if (nq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&nq->queue);
+out:
+ return nq;
}
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
{
- struct noop_data *nd = e->elevator_data;
+ struct noop_queue *nq = sched_queue;
- BUG_ON(!list_empty(&nd->queue));
- kfree(nd);
+ kfree(nq);
}
static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
.elevator_merge_req_fn = noop_merged_requests,
.elevator_dispatch_fn = noop_dispatch,
.elevator_add_req_fn = noop_add_request,
- .elevator_queue_empty_fn = noop_queue_empty,
.elevator_former_req_fn = noop_former_request,
.elevator_latter_req_fn = noop_latter_request,
- .elevator_init_fn = noop_init_queue,
- .elevator_exit_fn = noop_exit_queue,
+ .elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
+ .elevator_free_sched_queue_fn = noop_free_noop_queue,
},
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
@@ -112,6 +114,7 @@ struct elevator_queue
{
struct elevator_ops *ops;
void *elevator_data;
+ void *sched_queue;
struct kobject kobj;
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
` (36 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.
noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.
Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/elevator-fq.c | 160 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 67 +++++++++++++++++++
block/elevator.c | 35 ++++++++++-
include/linux/elevator.h | 14 ++++
4 files changed, 274 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
/* Free up async idle queue */
elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* Optimization for io schedulers having single ioq */
+ if (elv_iosched_single_ioq(e))
+ elv_release_ioq(e, &iog->ioq);
+#endif
}
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
rq->iog = iog;
}
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask)
+{
+ struct elevator_queue *e = q->elevator;
+ unsigned long flags;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
+ struct io_group *iog;
+ void *sched_q = NULL, *new_sched_q = NULL;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ /* Determine the io group request belongs to */
+ iog = rq->iog;
+ BUG_ON(!iog);
+
+retry:
+ /* Get the iosched queue */
+ ioq = io_group_ioq(iog);
+ if (!ioq) {
+ /* io queue and sched_queue needs to be allocated */
+ BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+ if (new_sched_q) {
+ goto alloc_ioq;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ /* Call io scheduer to create scheduler queue */
+ new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+ e, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+ gfp_mask | __GFP_ZERO);
+ if (!sched_q)
+ goto queue_fail;
+ }
+
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ sched_q = new_sched_q;
+ new_sched_q = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ e->ops->elevator_free_sched_queue_fn(e,
+ sched_q);
+ sched_q = NULL;
+ goto queue_fail;
+ }
+ }
+
+ elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ io_group_set_ioq(iog, ioq);
+ elv_mark_ioq_sync(ioq);
+ }
+
+ if (new_sched_q)
+ e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
+ /* Request reference */
+ elv_get_ioq(ioq);
+ rq->ioq = ioq;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 0;
+
+queue_fail:
+ WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+ elv_schedule_dispatch(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ struct io_group *iog;
+
+ /* Determine the io group and io queue of the bio submitting task */
+ iog = io_lookup_io_group_current(q);
+ if (!iog) {
+ /* May be task belongs to a cgroup for which io group has
+ * not been setup yet. */
+ return NULL;
+ }
+ return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ if (ioq) {
+ rq->ioq = NULL;
+ elv_put_ioq(ioq);
+ }
+}
+
#else /* GROUP_IOSCHED */
void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
{
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->efqd = efqd;
elv_ioq_set_ioprio_class(ioq, ioprio_class);
elv_ioq_set_ioprio(ioq, ioprio);
- ioq->pid = current->pid;
+
+ if (elv_iosched_single_ioq(eq))
+ ioq->pid = 0;
+ else
+ ioq->pid = current->pid;
+
ioq->sched_queue = sched_queue;
if (is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
/* async_queue and idle_queue are used only for cfq */
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
+
+ /* Single ioq per group, used for noop, deadline, anticipatory */
+ struct io_queue *ioq;
};
/**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return iog->entity.weight;
}
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ BUG_ON(!iog);
+ return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+ BUG_ON(!iog);
+ /* io group reference. Will be dropped when group is destroyed. */
+ elv_get_ioq(ioq);
+ iog->ioq = ioq;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return 0;
}
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
}
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
{
void *sched_queue = NULL;
+ /*
+ * If fair queuing is enabled, then queue allocation takes place
+ * during set_request() functions when request actually comes
+ * in.
+ */
+ if (elv_iosched_fair_queuing_enabled(eq))
+ return NULL;
+
if (eq->ops->elevator_alloc_sched_queue_fn) {
sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
elv_fq_set_request_io_group(q, rq);
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e))
+ return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e)) {
+ elv_fq_unset_request_ioq(q, rq);
+ return;
+ }
+
if (e->ops->elevator_put_req_fn)
e->ops->elevator_put_req_fn(rq);
}
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
/*
* Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
*/
void *elv_get_sched_queue_current(struct request_queue *q)
{
- return q->elevator->sched_queue;
+ /* Fair queuing is not enabled. There is only one queue. */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ return ioq_sched_queue(elv_lookup_ioq_current(q));
}
EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
/* iosched wants to use fq logic of elevator layer */
#define ELV_IOSCHED_NEED_FQ 1
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ 2
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
}
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return 0;
}
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
@ 2009-05-05 19:58 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda
Cc: vgoyal, akpm
Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.
noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.
Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 160 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 67 +++++++++++++++++++
block/elevator.c | 35 ++++++++++-
include/linux/elevator.h | 14 ++++
4 files changed, 274 insertions(+), 2 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
/* Free up async idle queue */
elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* Optimization for io schedulers having single ioq */
+ if (elv_iosched_single_ioq(e))
+ elv_release_ioq(e, &iog->ioq);
+#endif
}
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
rq->iog = iog;
}
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask)
+{
+ struct elevator_queue *e = q->elevator;
+ unsigned long flags;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
+ struct io_group *iog;
+ void *sched_q = NULL, *new_sched_q = NULL;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ /* Determine the io group request belongs to */
+ iog = rq->iog;
+ BUG_ON(!iog);
+
+retry:
+ /* Get the iosched queue */
+ ioq = io_group_ioq(iog);
+ if (!ioq) {
+ /* io queue and sched_queue needs to be allocated */
+ BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+ if (new_sched_q) {
+ goto alloc_ioq;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ /* Call io scheduer to create scheduler queue */
+ new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+ e, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+ gfp_mask | __GFP_ZERO);
+ if (!sched_q)
+ goto queue_fail;
+ }
+
+alloc_ioq:
+ if (new_ioq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ sched_q = new_sched_q;
+ new_sched_q = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq) {
+ e->ops->elevator_free_sched_queue_fn(e,
+ sched_q);
+ sched_q = NULL;
+ goto queue_fail;
+ }
+ }
+
+ elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ io_group_set_ioq(iog, ioq);
+ elv_mark_ioq_sync(ioq);
+ }
+
+ if (new_sched_q)
+ e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
+ /* Request reference */
+ elv_get_ioq(ioq);
+ rq->ioq = ioq;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 0;
+
+queue_fail:
+ WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+ elv_schedule_dispatch(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ struct io_group *iog;
+
+ /* Determine the io group and io queue of the bio submitting task */
+ iog = io_lookup_io_group_current(q);
+ if (!iog) {
+ /* May be task belongs to a cgroup for which io group has
+ * not been setup yet. */
+ return NULL;
+ }
+ return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ if (ioq) {
+ rq->ioq = NULL;
+ elv_put_ioq(ioq);
+ }
+}
+
#else /* GROUP_IOSCHED */
void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
{
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->efqd = efqd;
elv_ioq_set_ioprio_class(ioq, ioprio_class);
elv_ioq_set_ioprio(ioq, ioprio);
- ioq->pid = current->pid;
+
+ if (elv_iosched_single_ioq(eq))
+ ioq->pid = 0;
+ else
+ ioq->pid = current->pid;
+
ioq->sched_queue = sched_queue;
if (is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
/* async_queue and idle_queue are used only for cfq */
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
+
+ /* Single ioq per group, used for noop, deadline, anticipatory */
+ struct io_queue *ioq;
};
/**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return iog->entity.weight;
}
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ BUG_ON(!iog);
+ return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+ BUG_ON(!iog);
+ /* io group reference. Will be dropped when group is destroyed. */
+ elv_get_ioq(ioq);
+ iog->ioq = ioq;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
return 0;
}
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+ return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
}
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+ struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
{
void *sched_queue = NULL;
+ /*
+ * If fair queuing is enabled, then queue allocation takes place
+ * during set_request() functions when request actually comes
+ * in.
+ */
+ if (elv_iosched_fair_queuing_enabled(eq))
+ return NULL;
+
if (eq->ops->elevator_alloc_sched_queue_fn) {
sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
elv_fq_set_request_io_group(q, rq);
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e))
+ return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e)) {
+ elv_fq_unset_request_ioq(q, rq);
+ return;
+ }
+
if (e->ops->elevator_put_req_fn)
e->ops->elevator_put_req_fn(rq);
}
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
/*
* Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
*/
void *elv_get_sched_queue_current(struct request_queue *q)
{
- return q->elevator->sched_queue;
+ /* Fair queuing is not enabled. There is only one queue. */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ return ioq_sched_queue(elv_lookup_ioq_current(q));
}
EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
/* iosched wants to use fq logic of elevator layer */
#define ELV_IOSCHED_NEED_FQ 1
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ 2
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
}
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return 0;
}
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (9 preceding siblings ...)
2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
` (10 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 11 +++++++++++
block/noop-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
that do their own scheduling and require only minimal assistance from
the kernel.
+config IOSCHED_NOOP_HIER
+ bool "Noop Hierarchical Scheduling support"
+ depends on IOSCHED_NOOP && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in noop. In this mode noop keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_AS
tristate "Anticipatory I/O scheduler"
default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
.elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
.elevator_free_sched_queue_fn = noop_free_noop_queue,
},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
};
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (10 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
` (9 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 11 +++++++++++
block/deadline-iosched.c | 3 +++
2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
a disk at any one time, its behaviour is almost identical to the
anticipatory I/O scheduler and so is a good choice.
+config IOSCHED_DEADLINE_HIER
+ bool "Deadline Hierarchical Scheduling support"
+ depends on IOSCHED_DEADLINE && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in deadline. In this mode deadline keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (11 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
` (8 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer. One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.
TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
if two "dd" are going in two different cgroups. Needs to be looked into.
- AS write batch number of request adjustment happens upon every W->R batch
direction switch. This automatic adjustment depends on how much time a
read is taking after a W->R switch.
This does not gel very well when hierarhical scheduling is enabled and
every io group can have its separate read/write batch. Now if io group
switching takes place it creates issues.
Currently I have disabled write batch length adjustment in hierarchical
mode.
- Currently performance seems to be very bad in hierarhical mode. Needs
to be looked into.
- I think the whole idea of common layer doing time slice switching between
queues and then queue in turn running timed batches is not very good. May
be AS can maintain two queues (one for READS and other for WRITES) and let
common layer do the time slice switching between these two queues.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 12 +++
block/as-iosched.c | 177 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.c | 76 ++++++++++++++++----
include/linux/elevator.h | 16 ++++
4 files changed, 266 insertions(+), 15 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
deadline I/O scheduler, it can also be slower in some cases
especially some database loads.
+config IOSCHED_AS_HIER
+ bool "Anticipatory Hierarchical Scheduling support"
+ depends on IOSCHED_AS && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in anticipatory. In this mode
+ anticipatory keeps one IO queue per cgroup instead of a global
+ queue. Elevator fair queuing logic ensures fairness among various
+ queues.
+
config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
struct list_head fifo_list[2];
struct request *next_rq[2]; /* next in sort order */
+
+ /*
+ * If an as_queue is switched while a batch is running, then we
+ * store the time left before current batch will expire
+ */
+ long current_batch_time_left;
+
+ /*
+ * batch data dir when queue was scheduled out. This will be used
+ * to setup ad->batch_data_dir when queue is scheduled in.
+ */
+ int saved_batch_data_dir;
+
unsigned long last_check_fifo[2];
int write_batch_count; /* max # of reqs in a write batch */
int current_write_count; /* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Save batch data dir */
+ asq->saved_batch_data_dir = ad->batch_data_dir;
+
+ if (ad->changed_batch) {
+ /*
+ * In case of force expire, we come here. Batch changeover
+ * has been signalled but we are waiting for all the
+ * request to finish from previous batch and then start
+ * the new batch. Can't wait now. Mark that full batch time
+ * needs to be allocated when this queue is scheduled again.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->changed_batch = 0;
+ return;
+ }
+
+ if (ad->new_batch) {
+ /*
+ * We should come here only when new_batch has been set
+ * but no read request has been issued or if it is a forced
+ * expiry.
+ *
+ * In both the cases, new batch has not started yet so
+ * allocate full batch length for next scheduling opportunity.
+ * We don't do write batch size adjustment in hierarchical
+ * AS so that should not be an issue.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->new_batch = 0;
+ return;
+ }
+
+ /* Save how much time is left before current batch expires */
+ if (as_batch_expired(ad, asq))
+ asq->current_batch_time_left = 0;
+ else {
+ asq->current_batch_time_left = ad->current_batch_expires
+ - jiffies;
+ BUG_ON((asq->current_batch_time_left) < 0);
+ }
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Adjust the batch expire time */
+ if (asq->current_batch_time_left)
+ ad->current_batch_expires = jiffies +
+ asq->current_batch_time_left;
+ /* restore asq batch_data_dir info */
+ ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
+{
+ struct as_queue *asq = sched_queue;
+ struct as_data *ad = q->elevator->elevator_data;
+
+ as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+ int slice_expired, int force)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+ int status = ad->antic_status;
+ struct as_queue *asq = sched_queue;
+
+ /* Forced expiry. We don't have a choice */
+ if (force) {
+ as_antic_stop(ad);
+ as_save_batch_context(ad, asq);
+ return 1;
+ }
+
+ /*
+ * We are waiting for requests to finish from last
+ * batch. Don't expire the queue now
+ */
+ if (ad->changed_batch)
+ goto keep_queue;
+
+ /*
+ * Wait for all requests from existing batch to finish before we
+ * switch the queue. New queue might change the batch direction
+ * and this is to be consistent with AS philosophy of not dispatching
+ * new requests to underlying drive till requests from requests
+ * from previous batch are completed.
+ */
+ if (ad->nr_dispatched)
+ goto keep_queue;
+
+ /*
+ * If AS anticipation is ON, stop it if slice expired, otherwise
+ * keep the queue.
+ */
+ if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+ if (slice_expired)
+ as_antic_stop(ad);
+ else
+ /*
+ * We are anticipating and time slice has not expired
+ * so I would rather prefer waiting than break the
+ * anticipation and expire the queue.
+ */
+ goto keep_queue;
+ }
+
+ /* We are good to expire the queue. Save batch context */
+ as_save_batch_context(ad, asq);
+ return 1;
+
+keep_queue:
+ return 0;
+}
+#endif
/*
* IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
}
}
+#ifndef CONFIG_IOSCHED_AS_HIER
/*
* Gathers timings and resizes the write batch automatically
*/
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
if (asq->write_batch_count < 1)
asq->write_batch_count = 1;
}
+#endif /* !CONFIG_IOSCHED_AS_HIER */
/*
* as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
* and writeback caches
*/
if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+ /*
+ * Dynamic updation of write batch length is disabled
+ * for hierarchical scheduling. It is difficult to do
+ * accurate accounting when queue switch can take place
+ * in the middle of the batch.
+ *
+ * Say, A, B are two groups. Following is the sequence of
+ * events.
+ *
+ * Servicing Write batch of A.
+ * Queue switch takes place and write batch of B starts.
+ * Batch switch takes place and read batch of B starts.
+ *
+ * In above scenario, writes issued in write batch of A
+ * might impact the write batch length of B. Which is not
+ * good.
+ */
update_write_batch(ad);
+#endif
ad->current_batch_expires = jiffies +
ad->batch_expire[BLK_RW_SYNC];
ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
.trim = as_trim,
.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+ .elevator_expire_ioq_fn = as_expire_ioq,
+ .elevator_active_ioq_set_fn = as_active_ioq_set,
},
-
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+ },
+#endif
.elevator_attrs = as_attrs,
.elevator_name = "anticipatory",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force);
static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
int old_idle, enable_idle;
struct elv_fq_data *efqd = ioq->efqd;
+ /* If idling is disabled from ioscheduler, return */
+ if (!elv_gen_idling_enabled(eq))
+ return;
/*
* Don't idle for async or idle io prio class
*/
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
ioq->pid = current->pid;
ioq->sched_queue = sched_queue;
- if (is_sync && !elv_ioq_class_idle(ioq))
+ if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
elv_mark_ioq_idle_window(ioq);
bfq_init_entity(&ioq->entity, iog);
ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
{
elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 0, 1)) {
+ elv_ioq_slice_expired(q);
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
- elv_activate_ioq(ioq, 1);
- elv_ioq_set_slice_end(ioq, 0);
- elv_mark_ioq_slice_new(ioq);
+ elv_activate_ioq(ioq, 1);
+ elv_ioq_set_slice_end(ioq, 0);
+ elv_mark_ioq_slice_new(ioq);
+ }
}
void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
elv_deactivate_ioq(efqd, ioq, 0);
}
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * is useful when elevator wants to drain iosched and wants to
+ * expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until
+ * and unless necessary. Currently AS can deny the switch if
+ * in the middle of batch switch.
+ *
+ * if 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to
+ * expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+ int force)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (e->ops->elevator_expire_ioq_fn)
+ return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+ slice_expired, force);
+
+ return 1;
+}
+
/* Common layer function to select the next queue to dispatch from */
void *elv_fq_select_ioq(struct request_queue *q, int force)
{
struct elv_fq_data *efqd = &q->elevator->efqd;
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+ int slice_expired = 1;
if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
goto keep_queue;
}
+ slice_expired = 0;
expire:
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, slice_expired, force))
+ elv_ioq_slice_expired(q);
+ else {
+ ioq = NULL;
+ goto keep_queue;
+ }
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
}
if (elv_ioq_class_idle(ioq)) {
- elv_ioq_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
goto done;
}
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq))
- elv_ioq_slice_expired(q);
- else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+ if (elv_ioq_slice_used(ioq)) {
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_ioq_slice_expired(q);
+ } else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
&& sync && !rq_noidle(rq))
elv_ioq_arm_slice_timer(q, 0);
}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
struct request*);
typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
#endif
struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
elevator_should_preempt_fn *elevator_should_preempt_fn;
elevator_update_idle_window_fn *elevator_update_idle_window_fn;
elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+ elevator_expire_ioq_fn *elevator_expire_ioq_fn;
#endif
};
@@ -252,6 +254,9 @@ enum {
/* iosched maintains only single ioq per group.*/
#define ELV_IOSCHED_SINGLE_IOQ 2
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE 4
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
}
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
return 0;
}
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (12 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
` (7 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o blkio_cgroup patches from Ryo to track async bios.
o Fernando is also working on another IO tracking mechanism. We are not
particular about any IO tracking mechanism. This patchset can make use
of any mechanism which makes it to upstream. For the time being making
use of Ryo's posting.
Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
---
block/blk-ioc.c | 37 +++---
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/biotrack.h | 97 +++++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 31 ++++-
init/Kconfig | 15 ++
mm/Makefile | 4 +-
mm/biotrack.c | 300 +++++++++++++++++++++++++++++++++++++++++
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 5 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 17 ++-
mm/swap_state.c | 2 +
19 files changed, 511 insertions(+), 30 deletions(-)
create mode 100644 include/linux/biotrack.h
create mode 100644 mm/biotrack.c
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
}
}
+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ioc->cgroup_changed = 0;
+#endif
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
- ret->cgroup_changed = 0;
-#endif
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);
return ret;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);
while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_cgroup_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else /* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
/* */
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;
static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
}
#endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT (16)
+#define PCG_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
endif # CGROUPS
+config CGROUP_BLKIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
config MM_OWNER
bool
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+ .io_context = &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *biog;
+ struct page_cgroup *pc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0); /* 0: default blkio_cgroup id */
+ unlock_page_cgroup(pc);
+ if (!mm)
+ return;
+
+ rcu_read_lock();
+ biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog)) {
+ rcu_read_unlock();
+ return;
+ }
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "biog" so the css_id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default_blkio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_inc(&biog->io_context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ if (!biog)
+ return ERR_PTR(-ENOMEM);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR_PTR(-ENOMEM);
+ }
+ biog->io_context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ put_io_context(biog->io_context);
+ free_css_id(&blkio_cgroup_subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc) {
+ lock_page_cgroup(pc);
+ id = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+ unsigned long id;
+
+ id = get_blkio_cgroup_id(bio);
+ rcu_read_lock();
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (css)
+ biog = container_of(css, struct blkio_cgroup, css);
+ else
+ biog = &default_blkio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ rcu_read_unlock();
+ return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id: blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css;
+
+ if (blkio_cgroup_disabled())
+ return NULL;
+
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (!css)
+ return NULL;
+ cgrp = css->cgroup;
+ return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+ unsigned long id;
+
+ rcu_read_lock();
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+ .name = "blkio",
+ .create = blkio_cgroup_create,
+ .destroy = blkio_cgroup_destroy,
+ .populate = blkio_cgroup_populate,
+ .subsys_id = blkio_cgroup_subsys_id,
+ .use_id = 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/hash.h>
#include <linux/highmem.h>
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
#include <trace/block.h>
#include <asm/tlbflush.h>
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);
if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto release;
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
set_pte_at(mm, address, page_table, entry);
/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/biotrack.h>
static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
int nid, fail;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for_each_online_node(nid) {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
" don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
panic("Out of memory");
}
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;
for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+ " if you don't want\n");
}
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
#include <asm/pgtable.h>
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (13 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
` (6 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o So far we were assuming that a bio/rq belongs to the task who is submitting
it. It did not hold good in case of async writes. This patch makes use of
blkio_cgroup pataches to attribute the aysnc writes to right group instead
of task submitting the bio.
o For sync requests, we continue to assume that io belongs to the task
submitting it. Only in case of async requests, we make use of io tracking
patches to track the owner cgroup.
o So far cfq always caches the async queue pointer. With async requests now
not necessarily being tied to submitting task io context, caching the
pointer will not help for async queues. This patch introduces a new config
option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
old behavior where async queue pointer is cached in task context. If it
is not set, async queue pointer is not cached and we take help of bio
tracking patches to determine group bio belongs to and then map it to
async queue of that group.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 16 +++++
block/as-iosched.c | 2 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 149 ++++++++++++++++++++++++++++++++++++----------
block/deadline-iosched.c | 2 +-
block/elevator-fq.c | 131 ++++++++++++++++++++++++++++++++++-------
block/elevator-fq.h | 34 +++++++++-
block/elevator.c | 13 ++--
include/linux/elevator.h | 19 +++++-
9 files changed, 304 insertions(+), 69 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
+config TRACK_ASYNC_CONTEXT
+ bool "Determine async request context from bio"
+ depends on GROUP_IOSCHED
+ select CGROUP_BLKIO
+ default n
+ ---help---
+ Normally async request is attributed to the task submitting the
+ request. With group ioscheduling, for accurate accounting of
+ async writes, one needs to map the request to original task/cgroup
+ which originated the request and not the submitter of the request.
+
+ Currently there are generic io tracking patches to provide facility
+ to map bio to original owner. If this option is set, for async
+ request, original owner of the bio is decided by using io tracking
+ patches otherwise we continue to attribute the request to the
+ submitting thread.
endmenu
endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
- struct as_queue *asq = elv_get_sched_queue_current(q);
+ struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
if (!asq)
return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}
static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;
if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+ int, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[!!is_sync];
}
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
- struct cfq_queue *cfqq, int is_sync)
-{
- cic->cfqq[!!is_sync] = cfqq;
-}
-
/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
*/
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
{
- if (bio_data_dir(bio) == READ || bio_sync(bio))
- return 1;
+ struct cfq_queue *cfqq = NULL;
- return 0;
+ cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct io_group *iog;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+ if (!iog) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
+ return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+ struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
+ cic->cfqq[!!is_sync] = cfqq;
}
static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
/*
* Disallow merge of a sync bio into an async request.
*/
- if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+ if (elv_bio_sync(bio) && !rq_is_sync(rq))
return 0;
/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return 0;
- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq == RQ_CFQQ(rq))
return 1;
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(q->queue_lock, flags);
cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
+ struct io_group *iog = io_lookup_io_group_current(q);
struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+ /*
+ * Drop the reference to old queue unconditionally. Don't
+ * worry whether new async prio queue has been allocated
+ * or not.
+ */
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue(cfqq);
+
+ /*
+ * Why to allocate new queue now? Will it not be automatically
+ * allocated whenever another async request from same context
+ * comes? Keeping it for the time being because existing cfq
+ * code allocates the new queue immediately upon prio change
+ */
+ new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
- }
+ if (new_cfqq)
+ cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
}
cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_IOSCHED_CFQ_HIER */
static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
if (!cfqq) {
if (new_cfqq) {
goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
cfqq->ioq = ioq;
cfq_init_prio_data(cfqq, ioc);
- elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
- cfqq->org_ioprio, is_sync);
+ elv_init_ioq(q->elevator, ioq, iog, cfqq,
+ cfqq->org_ioprio_class, cfqq->org_ioprio,
+ is_sync);
if (is_sync) {
if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
}
static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
if (!is_sync) {
async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
}
if (!cfqq) {
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
if (!cfqq)
return NULL;
}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
if (!is_sync && !async_cfqq)
io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
- /* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ * This comes from the fact that we might be getting async requests
+ * which belong to a different cgroup altogether than the cgroup
+ * iocontext belongs to. And this thread might be submitting bios
+ * from various cgroups. So every time async queue will be different
+ * based on the cgroup of the bio/rq. Can't cache the async cfqq
+ * pointer in cic.
+ */
+ if (is_sync)
+ elv_get_ioq(cfqq->ioq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
elv_get_ioq(cfqq->ioq);
+#endif
return cfqq;
}
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+ gfp_mask);
if (!cfqq)
goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
int ret;
struct deadline_queue *dq;
- dq = elv_get_sched_queue_current(q);
+ dq = elv_get_sched_queue_bio(q, bio);
if (!dq)
return ELEVATOR_NO_MERGE;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
#include <linux/blkdev.h>
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
void elv_activate_ioq(struct io_queue *ioq, int add_front);
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
static int bfq_update_next_active(struct io_sched_data *sd)
{
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &io_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
struct io_cgroup, css);
}
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
return NULL;
}
+/* Lookup the io group of the current task */
struct io_group *io_lookup_io_group_current(struct request_queue *q)
{
struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
return iog;
}
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+ unsigned long bio_cgroup_id;
+ struct cgroup *cgroup;
+
+ /* blk_get_request can reach here without passing a bio */
+ if (!bio)
+ return NULL;
+
+ if (bio_barrier(bio)) {
+ /*
+ * Map barrier requests to root group. May be more special
+ * bio cases should come here
+ */
+ return NULL;
+ }
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (elv_bio_sync(bio)) {
+ /* sync io. Determine cgroup from submitting task context. */
+ cgroup = task_cgroup(current, io_subsys_id);
+ return cgroup;
+ }
+
+ /* Async io. Determine cgroup from with cgroup id stored in page */
+ bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+ if (!bio_cgroup_id)
+ return NULL;
+
+ cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+ cgroup = task_cgroup(current, io_subsys_id);
+#endif
+ return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+ struct cgroup *cgrp;
+ struct io_cgroup *iocg = NULL;
+
+ cgrp = get_cgroup_from_bio(bio);
+ if (!cgrp)
+ return &io_root_cgroup;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ if (!iocg)
+ return &io_root_cgroup;
+
+ return iocg;
+}
+
/*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
*/
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
{
struct cgroup *cgroup;
struct io_group *iog;
struct elv_fq_data *efqd = &q->elevator->efqd;
rcu_read_lock();
- cgroup = task_cgroup(current, io_subsys_id);
- iog = io_find_alloc_group(q, cgroup, efqd, create);
- if (!iog) {
+ cgroup = get_cgroup_from_bio(bio);
+ if (!cgroup) {
if (create)
iog = efqd->root_group;
- else
+ else {
/*
* bio merge functions doing lookup don't want to
* map bio to root group by default
*/
iog = NULL;
+ }
+ goto out;
+ }
+
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ iog = NULL;
}
+out:
rcu_read_unlock();
return iog;
}
+EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
return 1;
/* Determine the io group of the bio submitting task */
- iog = io_get_io_group(q, 0);
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
/* May be task belongs to a differet cgroup for which io
* group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
}
/* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+ struct bio *bio)
{
struct io_group *iog;
unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
* io group to which rq belongs. Later we should make use of
* bio cgroup patches to determine the io group */
spin_lock_irqsave(q->queue_lock, flags);
- iog = io_get_io_group(q, 1);
+ iog = io_get_io_group_bio(q, bio, 1);
spin_unlock_irqrestore(q->queue_lock, flags);
BUG_ON(!iog);
@@ -1797,7 +1870,7 @@ alloc_ioq:
}
}
- elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+ elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
}
@@ -1822,17 +1895,17 @@ queue_fail:
}
/*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
* per io group io schedulers.
*/
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
struct io_group *iog;
- /* Determine the io group and io queue of the bio submitting task */
- iog = io_lookup_io_group_current(q);
+ /* lookup the io group and io queue of the bio submitting task */
+ iog = io_get_io_group_bio(q, bio, 0);
if (!iog) {
- /* May be task belongs to a cgroup for which io group has
+ /* May be bio belongs to a cgroup for which io group has
* not been setup yet. */
return NULL;
}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+ int create)
+{
+ return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
return q->elevator->efqd.root_group;
}
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+ return q->elevator->efqd.root_group;
+}
+
#endif /* CONFIG_GROUP_IOSCHED*/
/* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
EXPORT_SYMBOL(elv_free_ioq);
int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio,
- int is_sync)
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync)
{
struct elv_fq_data *efqd = &eq->efqd;
- struct io_group *iog = io_lookup_io_group_current(efqd->queue);
RB_CLEAR_NODE(&ioq->entity.rb_node);
atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
+ if (ioq)
+ elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+ elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+ elv_ioq_nr_dispatched(ioq));
return ioq;
}
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return;
- elv_log_ioq(efqd, ioq, "complete");
+ elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+ elv_ioq_nr_dispatched(ioq));
elv_update_hw_tag(efqd);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq);
+ struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
extern void elv_fq_unset_request_ioq(struct request_queue *q,
struct request *rq);
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
iog->ioq = ioq;
}
+static inline struct io_group *rq_iog(struct request_queue *q,
+ struct request *rq)
+{
+ return rq->iog;
+}
+
#else /* !GROUP_IOSCHED */
/*
* No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
*/
static inline void io_disconnect_groups(struct elevator_queue *e) {}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
#endif /* GROUP_IOSCHED */
/* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
extern void __elv_ioq_slice_expired(struct request_queue *q,
struct io_queue *ioq);
extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
- void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+ struct io_group *iog, void *sched_queue, int ioprio_class,
+ int ioprio, int is_sync);
extern void elv_schedule_dispatch(struct request_queue *q);
extern int elv_hw_tag(struct elevator_queue *e);
extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
int ioprio, struct io_queue *ioq);
extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+ struct bio *bio, int create);
extern int elv_nr_busy_ioq(struct elevator_queue *e);
extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
}
static inline void elv_fq_set_request_io_group(struct request_queue *q,
- struct request *rq)
+ struct request *rq, struct bio *bio)
{
}
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
return NULL;
}
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
- elv_fq_set_request_io_group(q, rq);
+ elv_fq_set_request_io_group(q, rq, bio);
/*
* Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
EXPORT_SYMBOL(elv_select_sched_queue);
/*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
*
* If fair queuing is enabled, determine the io group of task and retrieve
* the ioq pointer from that. This is used by only single queue ioschedulers
* for retrieving the queue associated with the group to decide whether the
* new bio can do a front merge or not.
*/
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
{
/* Fair queuing is not enabled. There is only one queue. */
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return q->elevator->sched_queue;
- return ioq_sched_queue(elv_lookup_ioq_current(q));
+ return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
}
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+ if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+ return 1;
+ return 0;
+}
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 16/18] io-controller: Per cgroup request descriptor support
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (14 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
` (5 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o Currently a request queue has got fixed number of request descriptors for
sync and async requests. Once the request descriptors are consumed, new
processes are put to sleep and they effectively become serialized. Because
sync and async queues are separate, async requests don't impact sync ones
but if one is looking for fairness between async requests, that is not
achievable if request queue descriptors become bottleneck.
o Make request descriptor's per io group so that if there is lots of IO
going on in one cgroup, it does not impact the IO of other group.
o This is just one relatively simple way of doing things. This patch will
probably change after the feedback. Folks have raised concerns that in
hierchical setup, child's request descriptors should be capped by parent's
request descriptors. May be we need to have per cgroup per device files
in cgroups where one can specify the upper limit of request descriptors
and whenever a cgroup is created one needs to assign request descritor
limit making sure total sum of child's request descriptor is not more than
of parent.
I guess something like memory controller. Anyway, that would be the next
step. For the time being, we have implemented something simpler as follows.
o This patch implements the per cgroup request descriptors. request pool per
queue is still common but every group will have its own wait list and its
own count of request descriptors allocated to that group for sync and async
queues. So effectively request_list becomes per io group property and not a
global request queue feature.
o Currently one can define q->nr_requests to limit request descriptors
allocated for the queue. Now there is another tunable q->nr_group_requests
which controls the requests descriptr limit per group. q->nr_requests
supercedes q->nr_group_requests to make sure if there are lots of groups
present, we don't end up allocating too many request descriptors on the
queue.
o Issues: Currently notion of congestion is per queue. With per group request
descriptor it is possible that queue is not congested but the group bio
will go into is congested.
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/blk-core.c | 216 ++++++++++++++++++++++++++++++++++--------------
block/blk-settings.c | 3 +
block/blk-sysfs.c | 57 ++++++++++---
block/elevator-fq.c | 14 +++
block/elevator-fq.h | 5 +
block/elevator.c | 6 +-
include/linux/blkdev.h | 62 +++++++++++++-
7 files changed, 283 insertions(+), 80 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
- rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+ struct request_list *rl = blk_get_request_list(q, NULL);
+
+ /*
+ * In case of group scheduling, request list is inside the associated
+ * group and when that group is instanciated, it takes care of
+ * initializing the request list also.
+ */
+ blk_init_request_list(rl);
+#endif
+ q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);
- if (!rl->rq_pool)
+ if (!q->rq_data.rq_pool)
return -ENOMEM;
return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
return NULL;
}
+ /* init starved waiter wait queue */
+ init_waitqueue_head(&q->rq_data.starved_wait);
+
/*
* if caller didn't supply a lock, they get per-queue locking with
* our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
{
if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
}
static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
if (!rq)
return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
if (priv) {
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
return NULL;
}
rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);
- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
* A request has just been released. Account for it, update the full and
* congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
+ BUG_ON(!rl->count[sync]);
rl->count[sync]--;
+
+ BUG_ON(!q->rq_data.count[sync]);
+ q->rq_data.count[sync]--;
+
if (priv)
- rl->elvpriv--;
+ q->rq_data.elvpriv--;
- __freed_request(q, sync);
+ __freed_request(q, sync, rl);
if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+
+ /* Wake up the starved process on global list, if any */
+ if (unlikely(q->rq_data.starved)) {
+ if (waitqueue_active(&q->rq_data.starved_wait))
+ wake_up(&q->rq_data.starved_wait);
+ q->rq_data.starved--;
+ }
}
/*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;
- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
- /*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
- */
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
+ if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, is_sync);
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+ blk_set_queue_full(q, is_sync);
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so set
+ * it as full, and mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else {
+ if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
+ /*
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
+ */
+ goto out;
}
}
- blk_set_queue_congested(q, is_sync);
}
/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+ if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+ goto out;
+
+ /*
+ * Allocation of request is allowed from queue perspective. Now check
+ * from per group request list
+ */
+
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
goto out;
rl->count[is_sync]++;
rl->starved[is_sync] = 0;
+ q->rq_data.count[is_sync]++;
+
priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
- rl->elvpriv++;
+ q->rq_data.elvpriv++;
if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ if (rq) {
+ /*
+ * TODO. Implement group reference counting and take the
+ * reference to the group to make sure group hence request
+ * list does not go away till rq finishes.
+ */
+ rq->rl = rl;
+ }
+#endif
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
/*
* in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
- rl->starved[is_sync] = 1;
-
- goto out;
+ if (unlikely(rl->count[is_sync] == 0)) {
+ /*
+ * If there is a request pending in other direction
+ * in same io group, then set the starved flag of
+ * the group request list. Otherwise, we need to
+ * make this process sleep in global starved list
+ * to make sure it will not sleep indefinitely.
+ */
+ if (rl->count[is_sync ^ 1] != 0) {
+ rl->starved[is_sync] = 1;
+ goto out;
+ } else {
+ /*
+ * It indicates to calling function to put
+ * task on global starved list. Not the best
+ * way
+ */
+ return ERR_PTR(-ENOMEM);
+ }
+ }
}
/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
- while (!rq) {
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+ while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;
- prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
- TASK_UNINTERRUPTIBLE);
+ if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+ /*
+ * Task failed allocation and needs to wait and
+ * try again. There are no requests pending from
+ * the io group hence need to sleep on global
+ * wait queue. Most likely the allocation failed
+ * because of memory issues.
+ */
+
+ q->rq_data.starved++;
+ prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+ &wait, TASK_UNINTERRUPTIBLE);
+ } else {
+ prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+ TASK_UNINTERRUPTIBLE);
+ }
trace_block_sleeprq(q, bio, rw_flags & 1);
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
spin_lock_irq(q->queue_lock);
finish_wait(&rl->wait[is_sync], &wait);
- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ /*
+ * After the sleep check the rl again in case cgrop bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
};
return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
BUG_ON(rw != READ && rw != WRITE);
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rq = get_request(q, rw, NULL, gfp_mask, rl);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);
BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
+ struct request_list *rl = blk_get_request_list(q, NULL);
unsigned long nr;
int ret = queue_var_store(&nr, page, count);
if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_SYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_SYNC);
- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_ASYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_ASYNC);
- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}
- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_ASYNC);
wake_up(&rl->wait[BLK_RW_ASYNC]);
}
spin_unlock_irq(q->queue_lock);
return ret;
}
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}
+#endif
static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;
blk_sync_queue(q);
- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_data.rq_pool)
+ mempool_destroy(q->rq_data.rq_pool);
if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
struct io_cgroup, css);
}
+struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+ struct io_group *iog;
+
+ iog = io_get_io_group_bio(q, bio, 1);
+ BUG_ON(!iog);
+ return &iog->rl;
+}
+
/*
* Search the bfq_group for bfqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+ blk_init_request_list(&iog->rl);
+
if (leaf == NULL) {
leaf = iog;
prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ blk_init_request_list(&iog->rl);
+
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
/* Single ioq per group, used for noop, deadline, anticipatory */
struct io_queue *ioq;
+
+ /* request list associated with the group */
+ struct request_list rl;
};
/**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+ struct bio *bio);
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_data.elvpriv) {
blk_start_queueing(q);
spin_unlock_irq(q->queue_lock);
msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
}
if (unplug_it && blk_queue_plugged(q)) {
- int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
- - q->in_flight;
+ int nrq = q->rq_data.count[BLK_RW_SYNC] +
+ q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
if (nrq >= q->unplug_thresh)
__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
struct sg_io_hdr;
#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 256 /* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ 64 /* Default maximum */
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ /* Default maximum */
+#endif
struct request;
typedef void (rq_end_io_fn)(struct request *, int);
struct request_list {
/*
- * count[], starved[], and wait[] are indexed by
+ * count[], starved and wait[] are indexed by
* BLK_RW_SYNC/BLK_RW_ASYNC
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
+ int starved;
+ /*
+ * Global list for starved tasks. A task will be queued here if
+ * it could not allocate request descriptor and the associated
+ * group request list does not have any requests pending.
+ */
+ wait_queue_head_t starved_wait;
};
/*
@@ -253,6 +283,7 @@ struct request {
#ifdef CONFIG_GROUP_IOSCHED
/* io group request belongs to */
struct io_group *iog;
+ struct request_list *rl;
#endif /* GROUP_IOSCHED */
#endif /* ELV_FAIR_QUEUING */
};
@@ -342,6 +373,9 @@ struct request_queue
*/
struct request_list rq;
+ /* Contains request pool and other data like starved data */
+ struct request_data rq_data;
+
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ /* Max # of per io group requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return io_group_get_request_list(q, bio);
+#else
+ return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+ struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ return rq->rl;
+#else
+ return blk_get_request_list(q, NULL);
+#endif
+}
+
/*
* Temporary export, until SCSI gets fixed up.
*/
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 17/18] io-controller: IO group refcounting support
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (15 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
` (4 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o In the original BFQ patch once a cgroup is being deleted, it will clean
up the associated io groups immediately and if there are any active io
queues with that group, these will be moved to root group. This movement
of queues is not good from fairness perspective as one can then create
a cgroup, dump lots of IO and then delete the cgroup and then potentially
get higher share. Apart from there are more issues hence it was felt that
we need a io group refcounting mechanism also so that io group can be
reclaimed asynchronously.
o This is a crude patch to implement io group refcounting. This is still
work in progress and Nauman and Divyesh are playing with more ideas.
o I can do basic cgroup creation, deletion, task movement operations and
there are no crashes (As was reported with V1 by Gui). Though I have not
verified that io groups are actually being freed. Will do it next.
o There are couple of hard to hit race conditions I am aware of. Will fix
that in upcoming versions. (RCU lookup when group might be going away
during cgroup deletion).
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/cfq-iosched.c | 16 ++-
block/elevator-fq.c | 441 ++++++++++++++++++++++++++++++++++-----------------
block/elevator-fq.h | 26 ++--
3 files changed, 320 insertions(+), 163 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
if (sync_cfqq != NULL) {
__iog = cfqq_to_io_group(sync_cfqq);
- if (iog != __iog)
- io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+ /*
+ * Drop reference to sync queue. A new queue sync queue will
+ * be assigned in new group upon arrival of a fresh request.
+ * If old queue has got requests, those reuests will be
+ * dispatched over a period of time and queue will be freed
+ * automatically.
+ */
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
}
spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = current->pid;
+
+ /* ioq reference on iog */
+ elv_get_iog(iog);
cfq_log_cfqq(cfqd, cfqq, "alloced");
}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
int force);
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
{
BUG_ON(sd->next_active != entity);
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ struct io_group *iog = NULL;
+
+ BUG_ON(entity == NULL);
+ if (entity->my_sched_data != NULL)
+ iog = container_of(entity, struct io_group, entity);
+ return iog;
+}
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
struct io_entity *entity)
{
}
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+ return NULL;
+}
#endif
/*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
struct io_entity *entity)
{
struct rb_node *next;
- struct io_queue *ioq = io_entity_to_ioq(entity);
BUG_ON(entity->tree != &st->idle);
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
}
bfq_extract(&st->idle, entity);
-
- /* Delete queue from idle list */
- if (ioq)
- list_del(&ioq->queue_list);
}
/**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
void bfq_get_entity(struct io_entity *entity)
{
struct io_queue *ioq = io_entity_to_ioq(entity);
+ struct io_group *iog = io_entity_to_iog(entity);
if (ioq)
elv_get_ioq(ioq);
+ else
+ elv_get_iog(iog);
}
/**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
{
struct io_entity *first_idle = st->first_idle;
struct io_entity *last_idle = st->last_idle;
- struct io_queue *ioq = io_entity_to_ioq(entity);
if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
st->last_idle = entity;
bfq_insert(&st->idle, entity);
-
- /* Add this queue to idle list */
- if (ioq)
- list_add(&ioq->queue_list, &ioq->efqd->idle_list);
}
/**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
struct io_entity *entity)
{
struct io_queue *ioq = NULL;
+ struct io_group *iog = NULL;
BUG_ON(!entity->on_st);
entity->on_st = 0;
st->wsum -= entity->weight;
+
ioq = io_entity_to_ioq(entity);
- if (!ioq)
+ if (ioq) {
+ elv_put_ioq(ioq);
return;
- elv_put_ioq(ioq);
+ }
+
+ iog = io_entity_to_iog(entity);
+ if (iog)
+ elv_put_iog(iog);
}
/**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
/*
* Release all the io group references to its async queues.
*/
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
{
int i, j;
for (i = 0; i < 2; i++)
for (j = 0; j < IOPRIO_BE_NR; j++)
- elv_release_ioq(e, &iog->async_queue[i][j]);
+ elv_release_ioq(&iog->async_queue[i][j]);
/* Free up async idle queue */
- elv_release_ioq(e, &iog->async_idle_queue);
+ elv_release_ioq(&iog->async_idle_queue);
#ifdef CONFIG_GROUP_IOSCHED
/* Optimization for io schedulers having single ioq */
- if (elv_iosched_single_ioq(e))
- elv_release_ioq(e, &iog->ioq);
+ if (iog->ioq)
+ elv_release_ioq(&iog->ioq);
#endif
}
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
entity = &iog->entity;
entity->parent = parent->my_entity;
entity->sched_data = &parent->sched_data;
+ if (entity->parent)
+ /* Child group reference on parent group */
+ elv_get_iog(parent);
}
/**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
if (!iog)
goto cleanup;
+ atomic_set(&iog->ref, 0);
+ iog->deleting = 0;
+
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
rcu_assign_pointer(leaf->key, key);
hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ /* io_cgroup reference on io group */
+ elv_get_iog(leaf);
+
hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+ /* elevator reference on io group */
+ elv_get_iog(leaf);
spin_unlock_irqrestore(&iocg->lock, flags);
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
if (!iocg)
return &io_root_cgroup;
+ /*
+ * If this cgroup io_cgroup is being deleted, map the bio to
+ * root cgroup
+ */
+ if (css_is_removed(&iocg->css))
+ return &io_root_cgroup;
+
return iocg;
}
/*
* Find the io group bio belongs to.
* If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
*/
struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
spin_lock_irq(&iocg->lock);
hlist_del_rcu(&iog->group_node);
spin_unlock_irq(&iocg->lock);
- io_put_io_group_queues(e, iog);
- kfree(iog);
+ io_put_io_group_queues(iog);
+ elv_put_iog(iog);
}
struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+ atomic_set(&iog->ref, 0);
+
blk_init_request_list(&iog->rl);
iocg = &io_root_cgroup;
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
+ /* elevator reference. */
+ elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
spin_unlock_irq(&iocg->lock);
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
}
/*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
*/
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
{
- int busy, resume;
- struct io_entity *entity = &ioq->entity;
- struct elv_fq_data *efqd = &e->efqd;
- struct io_service_tree *st = io_entity_service_tree(entity);
+ int i;
+ struct io_service_tree *st;
- busy = elv_ioq_busy(ioq);
- resume = !!ioq->nr_queued;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ if (!RB_EMPTY_ROOT(&st->active))
+ return 1;
+ }
- BUG_ON(resume && !entity->on_st);
- BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+ return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+ int queue_lock_held)
+{
+ int i;
+ struct io_service_tree *st;
/*
- * We could be moving an queue which is on idle tree of previous group
- * What to do? I guess anyway this queue does not have any requests.
- * just forget the entity and free up from idle tree.
- *
- * This needs cleanup. Hackish.
+ * If we are here then we got the queue lock if group was still on
+ * elevator list. If group had already been disconnected from elevator
+ * list, then we don't need the queue lock.
*/
- if (entity->tree == &st->idle) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
- bfq_put_idle_entity(st, entity);
- }
- if (busy) {
- BUG_ON(atomic_read(&ioq->ref) < 2);
-
- if (!resume)
- elv_del_ioq_busy(e, ioq, 0);
- else
- elv_deactivate_ioq(efqd, ioq, 0);
- }
+ /* Remove io group from cgroup list */
+ hlist_del(&iog->group_node);
/*
- * Here we use a reference to bfqg. We don't need a refcounter
- * as the cgroup reference will not be dropped, so that its
- * destroy() callback will not be invoked.
+ * Mark io group for deletion so that no new entry goes in
+ * idle tree. Any active queue will be removed from active
+ * tree and not put in to idle tree.
*/
- entity->parent = iog->my_entity;
- entity->sched_data = &iog->sched_data;
+ iog->deleting = 1;
- if (busy && resume)
- elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+ /* Flush idle tree. */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
- struct elevator_queue *eq;
- struct io_entity *entity = iog->my_entity;
- struct io_service_tree *st;
- int i;
+ /*
+ * Drop io group reference on all async queues. This group is
+ * going away so once these queues are empty, free those up
+ * instead of keeping these around in the hope that new IO
+ * will come.
+ *
+ * Note: If this group is disconnected from elevator, elevator
+ * switch must have already done it.
+ */
- eq = container_of(efqd, struct elevator_queue, efqd);
- hlist_del(&iog->elv_data_node);
- __bfq_deactivate_entity(entity, 0);
- io_put_io_group_queues(eq, iog);
+ io_put_io_group_queues(iog);
- for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
- st = iog->sched_data.service_tree + i;
+ if (!io_group_has_active_entities(iog)) {
+ /*
+ * io group does not have any active entites. Because this
+ * group has been decoupled from io_cgroup list and this
+ * cgroup is being deleted, this group should not receive
+ * any new IO. Hence it should be safe to deactivate this
+ * io group and remove from the scheduling tree.
+ */
+ __bfq_deactivate_entity(iog->my_entity, 0);
/*
- * The idle tree may still contain bfq_queues belonging
- * to exited task because they never migrated to a different
- * cgroup from the one being destroyed now. Noone else
- * can access them so it's safe to act without any lock.
+ * Because this io group does not have any active entities,
+ * it should be safe to remove it from elevator list and
+ * drop elvator reference so that upon dropping io_cgroup
+ * reference, this io group should be freed and we don't
+ * wait for elevator switch to happen to free the group
+ * up.
*/
- io_flush_idle_tree(st);
+ if (queue_lock_held) {
+ hlist_del(&iog->elv_data_node);
+ rcu_assign_pointer(iog->key, NULL);
+ /*
+ * Drop iog reference taken by elevator
+ * (efqd->group_list)
+ */
+ elv_put_iog(iog);
+ }
- BUG_ON(!RB_EMPTY_ROOT(&st->active));
- BUG_ON(!RB_EMPTY_ROOT(&st->idle));
}
- BUG_ON(iog->sched_data.next_active != NULL);
- BUG_ON(iog->sched_data.active_entity != NULL);
- BUG_ON(entity->tree != NULL);
+ /* Drop iocg reference on io group */
+ elv_put_iog(iog);
}
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- struct elv_fq_data *efqd = NULL;
- unsigned long uninitialized_var(flags);
-
- /* Remove io group from cgroup list */
- hlist_del(&iog->group_node);
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct hlist_node *n, *tmp;
+ struct io_group *iog;
+ unsigned long flags;
+ int queue_lock_held = 0;
+ struct elv_fq_data *efqd;
/*
* io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
* try to free up async queues again or flush the idle tree.
*/
- rcu_read_lock();
- efqd = rcu_dereference(iog->key);
- if (efqd != NULL) {
- spin_lock_irqsave(efqd->queue->queue_lock, flags);
- if (iog->key == efqd)
- __io_destroy_group(efqd, iog);
- spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
- }
- rcu_read_unlock();
-
- /*
- * No need to defer the kfree() to the end of the RCU grace
- * period: we are called from the destroy() callback of our
- * cgroup, so we can be sure that noone is a) still using
- * this cgroup or b) doing lookups in it.
- */
- kfree(iog);
-}
+retry:
+ spin_lock_irqsave(&iocg->lock, flags);
+ hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+ /* Take the group queue lock */
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd != NULL) {
+ if (spin_trylock_irq(efqd->queue->queue_lock)) {
+ if (iog->key == efqd) {
+ queue_lock_held = 1;
+ rcu_read_unlock();
+ goto locked;
+ }
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
- struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
- struct hlist_node *n, *tmp;
- struct io_group *iog;
+ /*
+ * After acquiring the queue lock, we found
+ * iog->key==NULL, that means elevator switch
+ * completed, group is no longer connected on
+ * elevator hence we can proceed safely without
+ * queue lock.
+ */
+ spin_unlock_irq(efqd->queue->queue_lock);
+ } else {
+ /*
+ * Did not get the queue lock while trying.
+ * Backout. Drop iocg->lock and try again
+ */
+ rcu_read_unlock();
+ spin_unlock_irqrestore(&iocg->lock, flags);
+ udelay(100);
+ goto retry;
- /*
- * Since we are destroying the cgroup, there are no more tasks
- * referencing it, and all the RCU grace periods that may have
- * referenced it are ended (as the destruction of the parent
- * cgroup is RCU-safe); bgrp->group_data will not be accessed by
- * anything else and we don't need any synchronization.
- */
- hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
- io_destroy_group(iocg, iog);
+ }
+ }
+ /*
+ * We come here when iog->key==NULL, that means elevator switch
+ * has already taken place and now this group is no more
+ * connected on elevator and one does not have to have a
+ * queue lock to do the cleanup.
+ */
+ rcu_read_unlock();
+locked:
+ __iocg_destroy(iocg, iog, queue_lock_held);
+ if (queue_lock_held) {
+ spin_unlock_irq(efqd->queue->queue_lock);
+ queue_lock_held = 0;
+ }
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);
BUG_ON(!hlist_empty(&iocg->group_data));
kfree(iocg);
}
+/* Should be called with queue lock held */
void io_disconnect_groups(struct elevator_queue *e)
{
struct hlist_node *pos, *n;
struct io_group *iog;
struct elv_fq_data *efqd = &e->efqd;
+ int i;
+ struct io_service_tree *st;
hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
elv_data_node) {
- hlist_del(&iog->elv_data_node);
-
+ /*
+ * At this point of time group should be on idle tree. This
+ * would extract the group from idle tree.
+ */
__bfq_deactivate_entity(iog->my_entity, 0);
+ /* Flush all the idle trees of the group */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ io_flush_idle_tree(st);
+ }
+
+ /*
+ * This has to be here also apart from cgroup cleanup path
+ * and the reason being that if async queue reference of the
+ * group are not dropped, then async ioq as well as associated
+ * queue will not be reclaimed. Apart from that async cfqq
+ * has to be cleaned up before elevator goes away.
+ */
+ io_put_io_group_queues(iog);
+
/*
* Don't remove from the group hash, just set an
* invalid key. No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
* implies also that new elements cannot be added
* to the list.
*/
+ hlist_del(&iog->elv_data_node);
rcu_assign_pointer(iog->key, NULL);
- io_put_io_group_queues(e, iog);
+ /* Drop iog reference taken by elevator (efqd->group_list)*/
+ elv_put_iog(iog);
}
}
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+ It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+ struct io_service_tree *st;
+ struct io_entity *entity = iog->my_entity;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+ BUG_ON(st->wsum != 0);
+ }
+
+ BUG_ON(iog->sched_data.next_active != NULL);
+ BUG_ON(iog->sched_data.active_entity != NULL);
+ BUG_ON(entity != NULL && entity->tree != NULL);
+
+ kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+ struct io_group *parent = NULL;
+
+ BUG_ON(!iog);
+
+ BUG_ON(atomic_read(&iog->ref) <= 0);
+ if (!atomic_dec_and_test(&iog->ref))
+ return;
+
+ BUG_ON(iog->entity.on_st);
+
+ if (iog->my_entity)
+ parent = container_of(iog->my_entity->parent,
+ struct io_group, entity);
+ io_group_cleanup(iog);
+
+ if (parent)
+ elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
struct cgroup_subsys io_subsys = {
.name = "io",
.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
+ /* ioq reference on iog */
+ elv_get_iog(iog);
}
if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd.root_group;
- io_put_io_group_queues(e, iog);
+ io_put_io_group_queues(iog);
kfree(iog);
}
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
}
EXPORT_SYMBOL(elv_put_ioq);
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
{
- struct io_group *root_group = e->efqd.root_group;
struct io_queue *ioq = *ioq_ptr;
if (ioq != NULL) {
- io_ioq_move(e, ioq, root_group);
/* Drop the reference taken by the io group */
elv_put_ioq(ioq);
*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
int requeue)
{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
if (ioq == efqd->active_queue)
elv_reset_active_ioq(efqd);
+ /*
+ * The io group ioq belongs to is going away. Don't requeue the
+ * ioq on idle tree. Free it.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+ if (iog->deleting == 1)
+ requeue = 0;
+#endif
bfq_deactivate_entity(&ioq->entity, requeue);
}
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
}
}
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
- struct io_queue *ioq, *n;
- struct elv_fq_data *efqd = &e->efqd;
-
- list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
- elv_deactivate_ioq(efqd, ioq, 0);
-}
-
/*
* Call iosched to let that elevator wants to expire the queue. This gives
* iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
INIT_WORK(&efqd->unplug_work, elv_kick_queue);
- INIT_LIST_HEAD(&efqd->idle_list);
INIT_HLIST_HEAD(&efqd->group_list);
efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
elv_shutdown_timer_wq(e);
spin_lock_irq(q->queue_lock);
- /* This should drop all the idle tree references of ioq */
- elv_free_idle_ioq_list(e);
- /* This should drop all the io group references of async queues */
+ /*
+ * This should drop all the references of async queues taken by
+ * io group.
+ *
+ * Also should should deactivate the group and extract from the
+ * idle tree. (group can not be on active tree now after the
+ * elevator has been drained).
+ *
+ * Should flush idle tree of the group which inturn will drop
+ * ioq reference taken by active/idle tree.
+ *
+ * Drop the iog reference taken by elevator.
+ */
io_disconnect_groups(e);
spin_unlock_irq(q->queue_lock);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
/* Pointer to generic elevator data structure */
struct elv_fq_data *efqd;
- struct list_head queue_list;
pid_t pid;
/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
* o All the other fields are protected by the @bfqd queue lock.
*/
struct io_group {
+ atomic_t ref;
struct io_entity entity;
struct hlist_node elv_data_node;
struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
/* request list associated with the group */
struct request_list rl;
+
+ /* io group is going away */
+ int deleting;
};
/**
@@ -279,9 +282,6 @@ struct elv_fq_data {
/* List of io groups hanging on this elevator */
struct hlist_head group_list;
- /* List of io queues on idle tree. */
- struct list_head idle_list;
-
struct request_queue *queue;
unsigned int busy_queues;
/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
#ifdef CONFIG_GROUP_IOSCHED
extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog);
extern void elv_fq_set_request_io_group(struct request_queue *q,
struct request *rq, struct bio *bio);
static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
extern struct request_list *io_group_get_request_list(struct request_queue *q,
struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
/* Returns single ioq associated with the io group. */
static inline struct io_queue *io_group_ioq(struct io_group *iog)
{
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
return rq->iog;
}
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
- struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
{
+ atomic_inc(&iog->ref);
}
+#else /* !GROUP_IOSCHED */
static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
return NULL;
}
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (16 preceding siblings ...)
2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
2009-05-05 20:24 ` Andrew Morton
` (3 subsequent siblings)
21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
fer
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
o Littile debugging aid for hierarchical IO scheduling.
o Enabled under CONFIG_DEBUG_GROUP_IOSCHED
o Currently it outputs more debug messages in blktrace output which helps
a great deal in debugging in hierarchical setup.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/Kconfig.iosched | 10 +++-
block/elevator-fq.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 6 ++
3 files changed, 141 insertions(+), 6 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
request, original owner of the bio is decided by using io tracking
patches otherwise we continue to attribute the request to the
submitting thread.
-endmenu
+config DEBUG_GROUP_IOSCHED
+ bool "Debug Hierarchical Scheduling support"
+ depends on CGROUPS && GROUP_IOSCHED
+ default n
+ ---help---
+ Enable some debugging hooks for hierarchical scheduling support.
+ Currently it just outputs more information in blktrace output.
+
+endmenu
endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
#define IO_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
#define IO_SERVICE_TREE_INIT ((struct io_service_tree) \
- { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+ { RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
iog = container_of(entity, struct io_group, entity);
return iog;
}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+ unsigned short id = iog->iocg_id;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+
+ if (!id)
+ goto out;
+
+ css = css_lookup(&io_subsys, id);
+ if (!css)
+ goto out;
+
+ if (!css_tryget(css))
+ goto out;
+
+ cgroup_path(css->cgroup, buf, buflen);
+
+ css_put(css);
+
+ rcu_read_unlock();
+ return;
+out:
+ rcu_read_unlock();
+ buf[0] = '\0';
+ return;
+}
+#endif
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
struct rb_node *node = &entity->rb_node;
bfq_insert(&st->active, entity);
-
+ st->nr_active++;
if (node->rb_left != NULL)
node = node->rb_left;
else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
node = bfq_find_deepest(&entity->rb_node);
bfq_extract(&st->active, entity);
-
+ st->nr_active--;
if (node != NULL)
bfq_update_active_tree(node);
}
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
blk_init_request_list(&iog->rl);
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
/* elevator reference. */
elv_get_iog(iog);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ iog->iocg_id = css_id(&iocg->css);
+#endif
spin_unlock_irq(&iocg->lock);
return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
.destroy = iocg_destroy,
.populate = iocg_populate,
.subsys_id = io_subsys_id,
+ .use_id = 1,
};
/*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
{
entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ struct elv_fq_data *efqd = ioq->efqd;
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+ " QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d",
+ served, ioq->nr_sectors,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
/* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
if (ioq) {
struct io_group *iog = ioq_to_io_group(ioq);
+
elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
- " weight=%ld group_weight=%ld",
+ " weight=%ld rq_queued=%d group_weight=%ld",
efqd->busy_queues,
ioq->entity.ioprio, ioq->entity.weight,
- iog_weight(iog));
+ ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_service_tree *grpst;
+ int nr_active = 0;
+ if (iog != efqd->root_group) {
+ grpst = io_entity_service_tree(
+ &iog->entity);
+ nr_active = grpst->nr_active;
+ }
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+ " nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+ " GTs=0x%lx rq_queued=%d", path, nr_active,
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
ioq->slice_end = 0;
elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
efqd->busy_queues++;
if (elv_ioq_class_rt(ioq))
efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+ "ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#endif
}
void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
BUG_ON(!elv_ioq_busy(ioq));
BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ struct io_group *iog = ioq_to_io_group(ioq);
+ io_group_path(iog, path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+ "QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+ "rq_queued=%d",
+ ioq->entity.total_service,
+ ioq->entity.total_sector_service,
+ path,
+ iog->entity.total_service,
+ iog->entity.total_sector_service,
+ ioq->nr_queued);
+ }
+#else
elv_log_ioq(efqd, ioq, "del from busy");
+#endif
elv_clear_ioq_busy(ioq);
BUG_ON(efqd->busy_queues == 0);
efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
elv_ioq_update_io_thinktime(ioq);
elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ char path[128];
+ io_group_path(rq_iog(q, rq), path, sizeof(path));
+ elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+ "rq_queued=%d", path, ioq->nr_queued);
+ }
+#endif
if (ioq == elv_active_ioq(q->elevator)) {
/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
struct rb_root active;
struct rb_root idle;
+ int nr_active;
+
struct io_entity *first_idle;
struct io_entity *last_idle;
@@ -245,6 +247,10 @@ struct io_group {
/* io group is going away */
int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ unsigned short iocg_id;
+#endif
};
/**
--
1.6.0.1
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 20:24 ` Andrew Morton
2009-05-05 19:58 ` Vivek Goyal
` (36 subsequent siblings)
37 siblings, 0 replies; 297+ messages in thread
From: Andrew Morton @ 2009-05-05 20:24 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Tue, 5 May 2009 15:58:27 -0400
Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> Hi All,
>
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> ...
> Currently primarily two other IO controller proposals are out there.
>
> dm-ioband
> ---------
> This patch set is from Ryo Tsuruta from valinux.
> ...
> IO-throttling
> -------------
> This patch set is from Andrea Righi provides max bandwidth controller.
I'm thinking we need to lock you guys in a room and come back in 15 minutes.
Seriously, how are we to resolve this? We could lock me in a room and
cmoe back in 15 days, but there's no reason to believe that I'd emerge
with the best answer.
I tend to think that a cgroup-based controller is the way to go.
Anything else will need to be wired up to cgroups _anyway_, and that
might end up messy.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
@ 2009-05-05 20:24 ` Andrew Morton
0 siblings, 0 replies; 297+ messages in thread
From: Andrew Morton @ 2009-05-05 20:24 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, vgoyal
On Tue, 5 May 2009 15:58:27 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Hi All,
>
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> ...
> Currently primarily two other IO controller proposals are out there.
>
> dm-ioband
> ---------
> This patch set is from Ryo Tsuruta from valinux.
> ...
> IO-throttling
> -------------
> This patch set is from Andrea Righi provides max bandwidth controller.
I'm thinking we need to lock you guys in a room and come back in 15 minutes.
Seriously, how are we to resolve this? We could lock me in a room and
cmoe back in 15 days, but there's no reason to believe that I'd emerge
with the best answer.
I tend to think that a cgroup-based controller is the way to go.
Anything else will need to be wired up to cgroups _anyway_, and that
might end up messy.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 20:24 ` Andrew Morton
(?)
@ 2009-05-05 22:20 ` Peter Zijlstra
2009-05-06 3:42 ` Balbir Singh
2009-05-06 3:42 ` Balbir Singh
-1 siblings, 2 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-05 22:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
righi.andrea, agk, dm-devel, snitzer, m-ikeda
On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
> I tend to think that a cgroup-based controller is the way to go.
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.
FWIW I subscribe to the io-scheduler faith as opposed to the
device-mapper cult ;-)
Also, I don't think a simple throttle will be very useful, a more mature
solution should cater to more use cases.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 22:20 ` Peter Zijlstra
@ 2009-05-06 3:42 ` Balbir Singh
2009-05-06 3:42 ` Balbir Singh
1 sibling, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 3:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
* Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> FWIW I subscribe to the io-scheduler faith as opposed to the
> device-mapper cult ;-)
>
> Also, I don't think a simple throttle will be very useful, a more mature
> solution should cater to more use cases.
>
I tend to agree, unless Andrea can prove us wrong. I don't think
throttling a task (not letting it consume CPU, memory when its IO
quota is exceeded) is a good idea. I've asked that question to Andrea
a few times, but got no response.
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 22:20 ` Peter Zijlstra
2009-05-06 3:42 ` Balbir Singh
@ 2009-05-06 3:42 ` Balbir Singh
2009-05-06 10:20 ` Fabio Checconi
` (3 more replies)
1 sibling, 4 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 3:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Vivek Goyal, nauman, dpshah, lizf, mikew,
fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
taka, guijianfeng, jmoyer, dhaval, linux-kernel, containers,
righi.andrea, agk, dm-devel, snitzer, m-ikeda
* Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> FWIW I subscribe to the io-scheduler faith as opposed to the
> device-mapper cult ;-)
>
> Also, I don't think a simple throttle will be very useful, a more mature
> solution should cater to more use cases.
>
I tend to agree, unless Andrea can prove us wrong. I don't think
throttling a task (not letting it consume CPU, memory when its IO
quota is exceeded) is a good idea. I've asked that question to Andrea
a few times, but got no response.
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 3:42 ` Balbir Singh
@ 2009-05-06 10:20 ` Fabio Checconi
2009-05-06 17:10 ` Balbir Singh
[not found] ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-05-06 18:47 ` Divyesh Shah
` (2 subsequent siblings)
3 siblings, 2 replies; 297+ messages in thread
From: Fabio Checconi @ 2009-05-06 10:20 UTC (permalink / raw)
To: Balbir Singh
Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, dpshah, lizf,
mikew, paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, linux-kernel, containers,
righi.andrea, agk, dm-devel, snitzer, m-ikeda
Hi,
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> Date: Wed, May 06, 2009 09:12:54AM +0530
>
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
>
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> >
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
>
from what I can see, the principle used by io-throttling is not too
different to what happens when bandwidth differentiation with synchronous
access patterns is achieved using idling at the io scheduler level.
When an io scheduler anticipates requests from a task/cgroup, all the
other tasks with pending (synchronous) requests are in fact blocked, and
the fact that the task being anticipated is allowed to submit additional
io while they remain blocked is what creates the bandwidth differentiation
among them.
Of course there are many differences, in particular related to the
latencies introduced by the two mechanisms, the granularity they use to
allocate disk service, and to what throttling and proportional share io
scheduling can or cannot guarantee, but FWIK both of them rely on
blocking tasks to create bandwidth differentiation.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 10:20 ` Fabio Checconi
@ 2009-05-06 17:10 ` Balbir Singh
[not found] ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
To: Fabio Checconi
Cc: dhaval, snitzer, dm-devel, jens.axboe, agk, paolo.valente,
fernando, jmoyer, righi.andrea, containers, linux-kernel,
Andrew Morton
* Fabio Checconi <fchecconi@gmail.com> [2009-05-06 12:20:30]:
> Hi,
>
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> >
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue, 5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > >
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > >
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > >
> > > > Seriously, how are we to resolve this? We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > >
> > > > I tend to think that a cgroup-based controller is the way to go.
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > >
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > >
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> >
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >
>
> from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
>
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
>
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.
My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
@ 2009-05-06 17:10 ` Balbir Singh
0 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
To: Fabio Checconi
Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
dm-devel, jens.axboe, Andrew Morton, containers, agk,
righi.andrea
* Fabio Checconi <fchecconi@gmail.com> [2009-05-06 12:20:30]:
> Hi,
>
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> >
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue, 5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > >
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > >
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > >
> > > > Seriously, how are we to resolve this? We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > >
> > > > I tend to think that a cgroup-based controller is the way to go.
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > >
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > >
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> >
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >
>
> from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
>
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
>
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.
My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
@ 2009-05-06 17:10 ` Balbir Singh
0 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
To: Fabio Checconi
Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
* Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [2009-05-06 12:20:30]:
> Hi,
>
> > From: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> >
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue, 5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > >
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > >
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > >
> > > > Seriously, how are we to resolve this? We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > >
> > > > I tend to think that a cgroup-based controller is the way to go.
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > >
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > >
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> >
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >
>
> from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
>
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
>
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.
My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 3:42 ` Balbir Singh
2009-05-06 10:20 ` Fabio Checconi
@ 2009-05-06 18:47 ` Divyesh Shah
[not found] ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-05-06 20:42 ` Andrea Righi
3 siblings, 0 replies; 297+ messages in thread
From: Divyesh Shah @ 2009-05-06 18:47 UTC (permalink / raw)
To: balbir
Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, lizf, mikew,
fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
taka, guijianfeng, jmoyer, dhaval, linux-kernel, containers,
righi.andrea, agk, dm-devel, snitzer, m-ikeda
Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
>
>> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
>>> On Tue, 5 May 2009 15:58:27 -0400
>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>>> ...
>>>> Currently primarily two other IO controller proposals are out there.
>>>>
>>>> dm-ioband
>>>> ---------
>>>> This patch set is from Ryo Tsuruta from valinux.
>>>> ...
>>>> IO-throttling
>>>> -------------
>>>> This patch set is from Andrea Righi provides max bandwidth controller.
>>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>>
>>> Seriously, how are we to resolve this? We could lock me in a room and
>>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>>> with the best answer.
>>>
>>> I tend to think that a cgroup-based controller is the way to go.
>>> Anything else will need to be wired up to cgroups _anyway_, and that
>>> might end up messy.
>> FWIW I subscribe to the io-scheduler faith as opposed to the
>> device-mapper cult ;-)
>>
>> Also, I don't think a simple throttle will be very useful, a more mature
>> solution should cater to more use cases.
>>
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
I agree with what Balbir said about the effects of throttling on memory and cpu usage of that task.
Nauman and I have been working on Vivek's set of patches (which also includes some patches by Nauman) and have been testing and developing on top of that. I've found this solution to be the one that takes us closest to a complete solution. This approach works well under the assumption that the queues are backlogged and in the limited testing that we've done so far doesn't fare that badly when they are not backlogged (though there is definitely room to improve there).
With buffered writes, when the queues are not backlogged I think it might be useful to explore into vm space and see if we can do something there w/o any impact to the tasks mem or cpu usage. I don't have any brilliant ideas on this now but want to get people thinking about this.
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 10:20 ` Fabio Checconi
2009-05-06 18:47 ` Divyesh Shah
2009-05-06 20:42 ` Andrea Righi
2 siblings, 0 replies; 297+ messages in thread
From: Fabio Checconi @ 2009-05-06 10:20 UTC (permalink / raw)
To: Balbir Singh
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
Hi,
> From: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Date: Wed, May 06, 2009 09:12:54AM +0530
>
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
>
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> >
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
>
from what I can see, the principle used by io-throttling is not too
different to what happens when bandwidth differentiation with synchronous
access patterns is achieved using idling at the io scheduler level.
When an io scheduler anticipates requests from a task/cgroup, all the
other tasks with pending (synchronous) requests are in fact blocked, and
the fact that the task being anticipated is allowed to submit additional
io while they remain blocked is what creates the bandwidth differentiation
among them.
Of course there are many differences, in particular related to the
latencies introduced by the two mechanisms, the granularity they use to
allocate disk service, and to what throttling and proportional share io
scheduling can or cannot guarantee, but FWIK both of them rely on
blocking tasks to create bandwidth differentiation.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-05-06 10:20 ` Fabio Checconi
@ 2009-05-06 18:47 ` Divyesh Shah
2009-05-06 20:42 ` Andrea Righi
2 siblings, 0 replies; 297+ messages in thread
From: Divyesh Shah @ 2009-05-06 18:47 UTC (permalink / raw)
To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
Balbir Singh wrote:
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
>
>> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
>>> On Tue, 5 May 2009 15:58:27 -0400
>>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>>> ...
>>>> Currently primarily two other IO controller proposals are out there.
>>>>
>>>> dm-ioband
>>>> ---------
>>>> This patch set is from Ryo Tsuruta from valinux.
>>>> ...
>>>> IO-throttling
>>>> -------------
>>>> This patch set is from Andrea Righi provides max bandwidth controller.
>>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>>
>>> Seriously, how are we to resolve this? We could lock me in a room and
>>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>>> with the best answer.
>>>
>>> I tend to think that a cgroup-based controller is the way to go.
>>> Anything else will need to be wired up to cgroups _anyway_, and that
>>> might end up messy.
>> FWIW I subscribe to the io-scheduler faith as opposed to the
>> device-mapper cult ;-)
>>
>> Also, I don't think a simple throttle will be very useful, a more mature
>> solution should cater to more use cases.
>>
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
I agree with what Balbir said about the effects of throttling on memory and cpu usage of that task.
Nauman and I have been working on Vivek's set of patches (which also includes some patches by Nauman) and have been testing and developing on top of that. I've found this solution to be the one that takes us closest to a complete solution. This approach works well under the assumption that the queues are backlogged and in the limited testing that we've done so far doesn't fare that badly when they are not backlogged (though there is definitely room to improve there).
With buffered writes, when the queues are not backlogged I think it might be useful to explore into vm space and see if we can do something there w/o any impact to the tasks mem or cpu usage. I don't have any brilliant ideas on this now but want to get people thinking about this.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-05-06 10:20 ` Fabio Checconi
2009-05-06 18:47 ` Divyesh Shah
@ 2009-05-06 20:42 ` Andrea Righi
2 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:42 UTC (permalink / raw)
To: Balbir Singh
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 09:12:54AM +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
>
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> >
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
Sorry Balbir, I probably missed your question. Or replied in a different
thread maybe...
Actually we could allow an offending cgroup to continue to submit IO
requests without throttling it directly. But if we don't want to waste
the memory with pending IO requests or pending writeback pages, we need
to block it sooner or later.
Instead of directly throttle the offending applications, we could block
them when we hit a max limit of requests or dirty pages, i.e. something
like congestion_wait(), but that's the same, no? the difference is that
in this case throttling is asynchronous. Or am I oversimplifying it?
As an example, with writeback IO io-throttle doesn't throttle the IO
requests directly, each request instead receives a deadline (depending
on the BW limit) and it's added into a rbtree. Then all the requests are
dispatched asynchronously using a kernel thread (kiothrottled) only when
the deadline is expired.
OK, there's a lot of space for improvements: provide many kernel threads
per block device, multiple queues/rbtrees, etc., but this is actually a
way to apply throttling asynchronously. The fact is that if I don't
apply the throttling also in balance_dirty_pages() (and I did so in the
last io-throttle version) or add a max limit of requests the rbtree
increases indefinitely...
That should be very similar to the proportional BW solution allocating a
quota of nr_requests per block device and per cgroup.
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 3:42 ` Balbir Singh
` (2 preceding siblings ...)
[not found] ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 20:42 ` Andrea Righi
3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:42 UTC (permalink / raw)
To: Balbir Singh
Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, dpshah, lizf,
mikew, fchecconi, paolo.valente, jens.axboe, ryov, fernando,
s-uchida, taka, guijianfeng, jmoyer, dhaval, linux-kernel,
containers, agk, dm-devel, snitzer, m-ikeda
On Wed, May 06, 2009 at 09:12:54AM +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
>
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> >
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
>
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
Sorry Balbir, I probably missed your question. Or replied in a different
thread maybe...
Actually we could allow an offending cgroup to continue to submit IO
requests without throttling it directly. But if we don't want to waste
the memory with pending IO requests or pending writeback pages, we need
to block it sooner or later.
Instead of directly throttle the offending applications, we could block
them when we hit a max limit of requests or dirty pages, i.e. something
like congestion_wait(), but that's the same, no? the difference is that
in this case throttling is asynchronous. Or am I oversimplifying it?
As an example, with writeback IO io-throttle doesn't throttle the IO
requests directly, each request instead receives a deadline (depending
on the BW limit) and it's added into a rbtree. Then all the requests are
dispatched asynchronously using a kernel thread (kiothrottled) only when
the deadline is expired.
OK, there's a lot of space for improvements: provide many kernel threads
per block device, multiple queues/rbtrees, etc., but this is actually a
way to apply throttling asynchronously. The fact is that if I don't
apply the throttling also in balance_dirty_pages() (and I did so in the
last io-throttle version) or add a max limit of requests the rbtree
increases indefinitely...
That should be very similar to the proportional BW solution allocating a
quota of nr_requests per block device and per cgroup.
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 20:24 ` Andrew Morton
(?)
(?)
@ 2009-05-06 2:33 ` Vivek Goyal
2009-05-06 17:59 ` Nauman Rafique
` (4 more replies)
-1 siblings, 5 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 2:33 UTC (permalink / raw)
To: Andrew Morton
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, peterz
On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
> I tend to think that a cgroup-based controller is the way to go.
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.
Hi Andrew,
Sorry, did not get what do you mean by cgroup based controller? If you
mean that we use cgroups for grouping tasks for controlling IO, then both
IO scheduler based controller as well as io throttling proposal do that.
dm-ioband also supports that up to some extent but it requires extra step of
transferring cgroup grouping information to dm-ioband device using dm-tools.
But if you meant that io-throttle patches, then I think it solves only
part of the problem and that is max bw control. It does not offer minimum
BW/minimum disk share gurantees as offered by proportional BW control.
IOW, it supports upper limit control and does not support a work conserving
IO controller which lets a group use the whole BW if competing groups are
not present. IMHO, proportional BW control is an important feature which
we will need and IIUC, io-throttle patches can't be easily extended to support
proportional BW control, OTOH, one should be able to extend IO scheduler
based proportional weight controller to also support max bw control.
Andrea, last time you were planning to have a look at my patches and see
if max bw controller can be implemented there. I got a feeling that it
should not be too difficult to implement it there. We already have the
hierarchical tree of io queues and groups in elevator layer and we run
BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
just a matter of also keeping track of IO rate per queue/group and we should
be easily be able to delay the dispatch of IO from a queue if its group has
crossed the specified max bw.
This should lead to less code and reduced complextiy (compared with the
case where we do max bw control with io-throttling patches and proportional
BW control using IO scheduler based control patches).
So do you think that it would make sense to do max BW control along with
proportional weight IO controller at IO scheduler? If yes, then we can
work together and continue to develop this patchset to also support max
bw control and meet your requirements and drop the io-throttling patches.
The only thing which concerns me is the fact that IO scheduler does not
have the view of higher level logical device. So if somebody has setup a
software RAID and wants to put max BW limit on software raid device, this
solution will not work. One shall have to live with max bw limits on
individual disks (where io scheduler is actually running). Do your patches
allow to put limit on software RAID devices also?
Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
of FIFO dispatch of buffered bios. Apart from that it tries to provide
fairness in terms of actual IO done and that would mean a seeky workload
will can use disk for much longer to get equivalent IO done and slow down
other applications. Implementing IO controller at IO scheduler level gives
us tigher control. Will it not meet your requirements? If you got specific
concerns with IO scheduler based contol patches, please highlight these and
we will see how these can be addressed.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 2:33 ` Vivek Goyal
@ 2009-05-06 17:59 ` Nauman Rafique
2009-05-06 20:07 ` Andrea Righi
` (3 subsequent siblings)
4 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-06 17:59 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, peterz
On Tue, May 5, 2009 at 7:33 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
>> On Tue, 5 May 2009 15:58:27 -0400
>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>> >
>> > Hi All,
>> >
>> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>> > ...
>> > Currently primarily two other IO controller proposals are out there.
>> >
>> > dm-ioband
>> > ---------
>> > This patch set is from Ryo Tsuruta from valinux.
>> > ...
>> > IO-throttling
>> > -------------
>> > This patch set is from Andrea Righi provides max bandwidth controller.
>>
>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>
>> Seriously, how are we to resolve this? We could lock me in a room and
>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>> with the best answer.
>>
>> I tend to think that a cgroup-based controller is the way to go.
>> Anything else will need to be wired up to cgroups _anyway_, and that
>> might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
In my opinion, IO throttling and dm-ioband are probably simpler, but
incomplete solutions to the problem. And for a solution to be
complete, it would have to be at a IO scheduler layer so it can do
things like taking an IO as soon as it comes and stick it to the front
of all the queues so that it can go to the disk right away. This patch
set is big, but it takes us in the right direction. Our ultimate goal
should be able to reach the level of control that we can have over CPU
and network resources. And I don't think IO throttling and dm-ioband
approaches take us in that direction.
>
> Thanks
> Vivek
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 2:33 ` Vivek Goyal
2009-05-06 17:59 ` Nauman Rafique
@ 2009-05-06 20:07 ` Andrea Righi
2009-05-06 21:21 ` Vivek Goyal
2009-05-06 21:21 ` Vivek Goyal
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 subsequent siblings)
4 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:07 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
Well, IMHO the big concern is at which level we want to implement the
logic of control: IO scheduler, when the IO requests are already
submitted and need to be dispatched, or at high level when the
applications generates IO requests (or maybe both).
And, as pointed by Andrew, do everything by a cgroup-based controller.
The other features, proportional BW, throttling, take the current ioprio
model in account, etc. are implementation details and any of the
proposed solutions can be extended to support all these features. I
mean, io-throttle can be extended to support proportional BW (for a
certain perspective it is already provided by the throttling water mark
in v16), as well as the IO scheduler based controller can be extended to
support absolute BW limits. The same for dm-ioband. I don't think
there're huge obstacle to merge the functionalities in this sense.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
Yes, sorry for my late, I quickly tested your patchset, but I still need
to understand many details of your solution. In the next days I'll
re-read everything carefully and I'll try to do a detailed review of
your patchset (just re-building the kernel with your patchset applied).
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
mmmh... changing the logic at the elevator and all IO schedulers doesn't
sound like reduced complexity and less code changed. With io-throttle we
just need to place the cgroup_io_throttle() hook in the right functions
where we want to apply throttling. This is a quite easy approach to
extend the IO control also to logical devices (more in general devices
that use their own make_request_fn) or even network-attached devices, as
well as networking filesystems, etc.
But I may be wrong. As I said I still need to review in the details your
solution.
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
It is surely worth to be explored. Honestly, I don't know if it would be
a better solution or not. Probably comparing some results with different
IO workloads is the best way to proceed and decide which is the right
way to go. This is necessary IMHO, before totally dropping one solution
or another.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
No, but as said above my patchset provides the interfaces to apply the
IO control and accounting wherever we want. At the moment there's just
one interface, cgroup_io_throttle().
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 20:07 ` Andrea Righi
@ 2009-05-06 21:21 ` Vivek Goyal
2009-05-06 21:21 ` Vivek Goyal
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:21 UTC (permalink / raw)
To: Andrea Righi
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 10:07:53PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> > On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > Hi Andrew,
> >
> > Sorry, did not get what do you mean by cgroup based controller? If you
> > mean that we use cgroups for grouping tasks for controlling IO, then both
> > IO scheduler based controller as well as io throttling proposal do that.
> > dm-ioband also supports that up to some extent but it requires extra step of
> > transferring cgroup grouping information to dm-ioband device using dm-tools.
> >
> > But if you meant that io-throttle patches, then I think it solves only
> > part of the problem and that is max bw control. It does not offer minimum
> > BW/minimum disk share gurantees as offered by proportional BW control.
> >
> > IOW, it supports upper limit control and does not support a work conserving
> > IO controller which lets a group use the whole BW if competing groups are
> > not present. IMHO, proportional BW control is an important feature which
> > we will need and IIUC, io-throttle patches can't be easily extended to support
> > proportional BW control, OTOH, one should be able to extend IO scheduler
> > based proportional weight controller to also support max bw control.
>
> Well, IMHO the big concern is at which level we want to implement the
> logic of control: IO scheduler, when the IO requests are already
> submitted and need to be dispatched, or at high level when the
> applications generates IO requests (or maybe both).
>
> And, as pointed by Andrew, do everything by a cgroup-based controller.
I am not sure what's the rationale behind that. Why to do it at higher
layer? Doing it at IO scheduler layer will make sure that one does not
breaks the IO scheduler's properties with-in cgroup. (See my other mail
with some io-throttling test results).
The advantage of higher layer mechanism is that it can also cover software
RAID devices well.
>
> The other features, proportional BW, throttling, take the current ioprio
> model in account, etc. are implementation details and any of the
> proposed solutions can be extended to support all these features. I
> mean, io-throttle can be extended to support proportional BW (for a
> certain perspective it is already provided by the throttling water mark
> in v16), as well as the IO scheduler based controller can be extended to
> support absolute BW limits. The same for dm-ioband. I don't think
> there're huge obstacle to merge the functionalities in this sense.
Yes, from technical point of view, one can implement a proportional BW
controller at higher layer also. But that would practically mean almost
re-implementing the CFQ logic at higher layer. Now why to get into all
that complexity. Why not simply make CFQ hiearchical to also handle the
groups?
Secondly, think of following odd scenarios if we implement a higher level
proportional BW controller which can offer the same feature as CFQ and
also can handle group scheduling.
Case1:
======
(Higher level proportional BW controller)
/dev/sda (CFQ)
So if somebody wants a group scheduling, we will be doing same IO control
at two places (with-in group). Once at higher level and second time at CFQ
level. Does not sound too logical to me.
Case2:
======
(Higher level proportional BW controller)
/dev/sda (NOOP)
This is other extrememt. Lower level IO scheduler does not offer any kind
of notion of class or prio with-in class and higher level scheduler will
still be maintaining all the infrastructure unnecessarily.
That's why I get back to this simple question again, why not extend the
IO schedulers to handle group scheduling and do both proportional BW and
max bw control there.
>
> >
> > Andrea, last time you were planning to have a look at my patches and see
> > if max bw controller can be implemented there. I got a feeling that it
> > should not be too difficult to implement it there. We already have the
> > hierarchical tree of io queues and groups in elevator layer and we run
> > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > just a matter of also keeping track of IO rate per queue/group and we should
> > be easily be able to delay the dispatch of IO from a queue if its group has
> > crossed the specified max bw.
>
> Yes, sorry for my late, I quickly tested your patchset, but I still need
> to understand many details of your solution. In the next days I'll
> re-read everything carefully and I'll try to do a detailed review of
> your patchset (just re-building the kernel with your patchset applied).
>
Sure. My patchset is still in the infancy stage. So don't expect great
results. But it does highlight the idea and design very well.
> >
> > This should lead to less code and reduced complextiy (compared with the
> > case where we do max bw control with io-throttling patches and proportional
> > BW control using IO scheduler based control patches).
>
> mmmh... changing the logic at the elevator and all IO schedulers doesn't
> sound like reduced complexity and less code changed. With io-throttle we
> just need to place the cgroup_io_throttle() hook in the right functions
> where we want to apply throttling. This is a quite easy approach to
> extend the IO control also to logical devices (more in general devices
> that use their own make_request_fn) or even network-attached devices, as
> well as networking filesystems, etc.
>
> But I may be wrong. As I said I still need to review in the details your
> solution.
Well I meant reduced code in the sense if we implement both max bw and
proportional bw at IO scheduler level instead of proportional BW at
IO scheduler and max bw at higher level.
I agree that doing max bw control at higher level has this advantage that
it covers all the kind of deivces (higher level logical devices) and IO
scheduler level solution does not do that. But this comes at the price
of broken IO scheduler properties with-in cgroup.
Maybe we can then implement both. A higher level max bw controller and a
max bw feature implemented along side proportional BW controller at IO
scheduler level. Folks who use hardware RAID, or single disk devices can
use max bw control of IO scheduler and those using software RAID devices
can use higher level max bw controller.
>
> >
> > So do you think that it would make sense to do max BW control along with
> > proportional weight IO controller at IO scheduler? If yes, then we can
> > work together and continue to develop this patchset to also support max
> > bw control and meet your requirements and drop the io-throttling patches.
>
> It is surely worth to be explored. Honestly, I don't know if it would be
> a better solution or not. Probably comparing some results with different
> IO workloads is the best way to proceed and decide which is the right
> way to go. This is necessary IMHO, before totally dropping one solution
> or another.
Sure. My patches have started giving some basic results but because there
is lot of work remaining before a fair comparison can be done on the
basis of performance under various work loads. So some more time to
go before we can do a fair comparison based on numbers.
>
> >
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also?
>
> No, but as said above my patchset provides the interfaces to apply the
> IO control and accounting wherever we want. At the moment there's just
> one interface, cgroup_io_throttle().
Sorry, I did not get it clearly. I guess I did not ask the question right.
So lets say I got a setup where there are two phyical devices /dev/sda and
/dev/sdb and I create a logical device (say using device mapper facilities)
on top of these two physical disks. And some application is generating
the IO for logical device lv0.
Appl
|
lv0
/ \
sda sdb
Where should I put the bandwidth limiting rules now for io-throtle. I
specify these for lv0 device or for sda and sdb devices?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 20:07 ` Andrea Righi
2009-05-06 21:21 ` Vivek Goyal
@ 2009-05-06 21:21 ` Vivek Goyal
[not found] ` <20090506212121.GI8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:21 UTC (permalink / raw)
To: Andrea Righi
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Wed, May 06, 2009 at 10:07:53PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> > On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > >
> > > > Hi All,
> > > >
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > >
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > >
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > >
> > > Seriously, how are we to resolve this? We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > >
> > > I tend to think that a cgroup-based controller is the way to go.
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> >
> > Hi Andrew,
> >
> > Sorry, did not get what do you mean by cgroup based controller? If you
> > mean that we use cgroups for grouping tasks for controlling IO, then both
> > IO scheduler based controller as well as io throttling proposal do that.
> > dm-ioband also supports that up to some extent but it requires extra step of
> > transferring cgroup grouping information to dm-ioband device using dm-tools.
> >
> > But if you meant that io-throttle patches, then I think it solves only
> > part of the problem and that is max bw control. It does not offer minimum
> > BW/minimum disk share gurantees as offered by proportional BW control.
> >
> > IOW, it supports upper limit control and does not support a work conserving
> > IO controller which lets a group use the whole BW if competing groups are
> > not present. IMHO, proportional BW control is an important feature which
> > we will need and IIUC, io-throttle patches can't be easily extended to support
> > proportional BW control, OTOH, one should be able to extend IO scheduler
> > based proportional weight controller to also support max bw control.
>
> Well, IMHO the big concern is at which level we want to implement the
> logic of control: IO scheduler, when the IO requests are already
> submitted and need to be dispatched, or at high level when the
> applications generates IO requests (or maybe both).
>
> And, as pointed by Andrew, do everything by a cgroup-based controller.
I am not sure what's the rationale behind that. Why to do it at higher
layer? Doing it at IO scheduler layer will make sure that one does not
breaks the IO scheduler's properties with-in cgroup. (See my other mail
with some io-throttling test results).
The advantage of higher layer mechanism is that it can also cover software
RAID devices well.
>
> The other features, proportional BW, throttling, take the current ioprio
> model in account, etc. are implementation details and any of the
> proposed solutions can be extended to support all these features. I
> mean, io-throttle can be extended to support proportional BW (for a
> certain perspective it is already provided by the throttling water mark
> in v16), as well as the IO scheduler based controller can be extended to
> support absolute BW limits. The same for dm-ioband. I don't think
> there're huge obstacle to merge the functionalities in this sense.
Yes, from technical point of view, one can implement a proportional BW
controller at higher layer also. But that would practically mean almost
re-implementing the CFQ logic at higher layer. Now why to get into all
that complexity. Why not simply make CFQ hiearchical to also handle the
groups?
Secondly, think of following odd scenarios if we implement a higher level
proportional BW controller which can offer the same feature as CFQ and
also can handle group scheduling.
Case1:
======
(Higher level proportional BW controller)
/dev/sda (CFQ)
So if somebody wants a group scheduling, we will be doing same IO control
at two places (with-in group). Once at higher level and second time at CFQ
level. Does not sound too logical to me.
Case2:
======
(Higher level proportional BW controller)
/dev/sda (NOOP)
This is other extrememt. Lower level IO scheduler does not offer any kind
of notion of class or prio with-in class and higher level scheduler will
still be maintaining all the infrastructure unnecessarily.
That's why I get back to this simple question again, why not extend the
IO schedulers to handle group scheduling and do both proportional BW and
max bw control there.
>
> >
> > Andrea, last time you were planning to have a look at my patches and see
> > if max bw controller can be implemented there. I got a feeling that it
> > should not be too difficult to implement it there. We already have the
> > hierarchical tree of io queues and groups in elevator layer and we run
> > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > just a matter of also keeping track of IO rate per queue/group and we should
> > be easily be able to delay the dispatch of IO from a queue if its group has
> > crossed the specified max bw.
>
> Yes, sorry for my late, I quickly tested your patchset, but I still need
> to understand many details of your solution. In the next days I'll
> re-read everything carefully and I'll try to do a detailed review of
> your patchset (just re-building the kernel with your patchset applied).
>
Sure. My patchset is still in the infancy stage. So don't expect great
results. But it does highlight the idea and design very well.
> >
> > This should lead to less code and reduced complextiy (compared with the
> > case where we do max bw control with io-throttling patches and proportional
> > BW control using IO scheduler based control patches).
>
> mmmh... changing the logic at the elevator and all IO schedulers doesn't
> sound like reduced complexity and less code changed. With io-throttle we
> just need to place the cgroup_io_throttle() hook in the right functions
> where we want to apply throttling. This is a quite easy approach to
> extend the IO control also to logical devices (more in general devices
> that use their own make_request_fn) or even network-attached devices, as
> well as networking filesystems, etc.
>
> But I may be wrong. As I said I still need to review in the details your
> solution.
Well I meant reduced code in the sense if we implement both max bw and
proportional bw at IO scheduler level instead of proportional BW at
IO scheduler and max bw at higher level.
I agree that doing max bw control at higher level has this advantage that
it covers all the kind of deivces (higher level logical devices) and IO
scheduler level solution does not do that. But this comes at the price
of broken IO scheduler properties with-in cgroup.
Maybe we can then implement both. A higher level max bw controller and a
max bw feature implemented along side proportional BW controller at IO
scheduler level. Folks who use hardware RAID, or single disk devices can
use max bw control of IO scheduler and those using software RAID devices
can use higher level max bw controller.
>
> >
> > So do you think that it would make sense to do max BW control along with
> > proportional weight IO controller at IO scheduler? If yes, then we can
> > work together and continue to develop this patchset to also support max
> > bw control and meet your requirements and drop the io-throttling patches.
>
> It is surely worth to be explored. Honestly, I don't know if it would be
> a better solution or not. Probably comparing some results with different
> IO workloads is the best way to proceed and decide which is the right
> way to go. This is necessary IMHO, before totally dropping one solution
> or another.
Sure. My patches have started giving some basic results but because there
is lot of work remaining before a fair comparison can be done on the
basis of performance under various work loads. So some more time to
go before we can do a fair comparison based on numbers.
>
> >
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also?
>
> No, but as said above my patchset provides the interfaces to apply the
> IO control and accounting wherever we want. At the moment there's just
> one interface, cgroup_io_throttle().
Sorry, I did not get it clearly. I guess I did not ask the question right.
So lets say I got a setup where there are two phyical devices /dev/sda and
/dev/sdb and I create a logical device (say using device mapper facilities)
on top of these two physical disks. And some application is generating
the IO for logical device lv0.
Appl
|
lv0
/ \
sda sdb
Where should I put the bandwidth limiting rules now for io-throtle. I
specify these for lv0 device or for sda and sdb devices?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 17:59 ` Nauman Rafique
2009-05-06 20:07 ` Andrea Righi
` (2 subsequent siblings)
3 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-06 17:59 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Tue, May 5, 2009 at 7:33 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
>> On Tue, 5 May 2009 15:58:27 -0400
>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> >
>> > Hi All,
>> >
>> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>> > ...
>> > Currently primarily two other IO controller proposals are out there.
>> >
>> > dm-ioband
>> > ---------
>> > This patch set is from Ryo Tsuruta from valinux.
>> > ...
>> > IO-throttling
>> > -------------
>> > This patch set is from Andrea Righi provides max bandwidth controller.
>>
>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>
>> Seriously, how are we to resolve this? We could lock me in a room and
>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>> with the best answer.
>>
>> I tend to think that a cgroup-based controller is the way to go.
>> Anything else will need to be wired up to cgroups _anyway_, and that
>> might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
In my opinion, IO throttling and dm-ioband are probably simpler, but
incomplete solutions to the problem. And for a solution to be
complete, it would have to be at a IO scheduler layer so it can do
things like taking an IO as soon as it comes and stick it to the front
of all the queues so that it can go to the disk right away. This patch
set is big, but it takes us in the right direction. Our ultimate goal
should be able to reach the level of control that we can have over CPU
and network resources. And I don't think IO throttling and dm-ioband
approaches take us in that direction.
>
> Thanks
> Vivek
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 17:59 ` Nauman Rafique
@ 2009-05-06 20:07 ` Andrea Righi
2009-05-06 20:32 ` Vivek Goyal
2009-05-07 0:18 ` Ryo Tsuruta
3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:07 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
Well, IMHO the big concern is at which level we want to implement the
logic of control: IO scheduler, when the IO requests are already
submitted and need to be dispatched, or at high level when the
applications generates IO requests (or maybe both).
And, as pointed by Andrew, do everything by a cgroup-based controller.
The other features, proportional BW, throttling, take the current ioprio
model in account, etc. are implementation details and any of the
proposed solutions can be extended to support all these features. I
mean, io-throttle can be extended to support proportional BW (for a
certain perspective it is already provided by the throttling water mark
in v16), as well as the IO scheduler based controller can be extended to
support absolute BW limits. The same for dm-ioband. I don't think
there're huge obstacle to merge the functionalities in this sense.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
Yes, sorry for my late, I quickly tested your patchset, but I still need
to understand many details of your solution. In the next days I'll
re-read everything carefully and I'll try to do a detailed review of
your patchset (just re-building the kernel with your patchset applied).
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
mmmh... changing the logic at the elevator and all IO schedulers doesn't
sound like reduced complexity and less code changed. With io-throttle we
just need to place the cgroup_io_throttle() hook in the right functions
where we want to apply throttling. This is a quite easy approach to
extend the IO control also to logical devices (more in general devices
that use their own make_request_fn) or even network-attached devices, as
well as networking filesystems, etc.
But I may be wrong. As I said I still need to review in the details your
solution.
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
It is surely worth to be explored. Honestly, I don't know if it would be
a better solution or not. Probably comparing some results with different
IO workloads is the best way to proceed and decide which is the right
way to go. This is necessary IMHO, before totally dropping one solution
or another.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
No, but as said above my patchset provides the interfaces to apply the
IO control and accounting wherever we want. At the moment there's just
one interface, cgroup_io_throttle().
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 17:59 ` Nauman Rafique
2009-05-06 20:07 ` Andrea Righi
@ 2009-05-06 20:32 ` Vivek Goyal
2009-05-07 0:18 ` Ryo Tsuruta
3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 20:32 UTC (permalink / raw)
To: Andrew Morton, Andrea Righi
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
Hi Andrea and others,
I always had this doubt in mind that any kind of 2nd level controller will
have no idea about underlying IO scheduler queues/semantics. So while it
can implement a particular cgroup policy (max bw like io-throttle or
proportional bw like dm-ioband) but there are high chances that it will
break IO scheduler's semantics in one way or other.
I had already sent out the results for dm-ioband in a separate thread.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
Here are some basic results with io-throttle. Andrea, please let me know
if you think this is procedural problem. Playing with io-throttle patches
for the first time.
I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
scheduler.
I have got one SATA drive with one partition on it.
I am trying to create one cgroup and assignn 8MB/s limit to it and launch
on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
between these tasks. Following are the results.
Following is my test script.
*******************************************************************
#!/bin/bash
mount /dev/sdb1 /mnt/sdb
mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
# Set bw limit of 8 MB/ps on sdb
echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
/cgroup/iot/test1/blockio.bandwidth-max
sync
echo 3 > /proc/sys/vm/drop_caches
echo $$ > /cgroup/iot/test1/tasks
# Launch a normal prio reader.
ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
pid1=$!
echo $pid1
# Launch an RT reader
ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2
wait $pid2
echo "RT task finished"
**********************************************************************
Test1
=====
Test two readers (one RT class and one BE class) and see how BW is
allocated with-in cgroup
With io-throttle patches
------------------------
- Two readers, first BE prio 7, second RT prio 0
234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
RT task finished
Note: See, there is no difference in the performance of RT or BE task.
Looks like these got throttled equally.
Without io-throttle patches
----------------------------
- Two readers, first BE prio 7, second RT prio 0
234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
Note: Because I can't limit the BW without io-throttle patches, so don't
worry about increased BW. But the important point is that RT task
gets much more BW than a BE prio 7 task.
Test2
====
- Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
distributed among these.
With io-throttle patches
------------------------
- Two readers, first BE prio 7, second BE prio 0
234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
High prio reader finished
Without io-throttle patches
---------------------------
- Two readers, first BE prio 7, second BE prio 0
234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
High prio reader finished
234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
Note: There is no service differentiation between prio 0 and prio 7 task
with io-throttle patches.
Test 3
======
- Run the one RT reader and one BE reader in root cgroup without any
limitations. I guess this should mean unlimited BW and behavior should
be same as with CFQ without io-throttling patches.
With io-throttle patches
=========================
Ran the test 4 times because I was getting different results in different
runs.
- Two readers, one RT prio 0 other BE prio 7
234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
RT task finished
234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
RT task finished
Note: Out of 4 runs, looks like twice it is complete priority inversion
and RT task finished after BE task. Rest of the two times, the
difference between BW of RT and BE task is much less as compared to
without patches. In fact once it was almost same.
Without io-throttle patches.
===========================
- Two readers, one RT prio 0 other BE prio 7 (4 runs)
234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
Note, How consistent the behavior is without io-throttle patches.
In summary, I think a 2nd level solution can ensure one policy on cgroups but
it will break other semantics/properties of IO scheduler with-in cgroup as
2nd level solution has no idea at run time what is the IO scheduler running
underneath and what kind of properties it has.
Andrea, please try it on your setup and see if you get similar results
on or. Hopefully it is not a configuration or test procedure issue on my
side.
Thanks
Vivek
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
>
> Thanks
> Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2009-05-06 20:32 ` Vivek Goyal
@ 2009-05-07 0:18 ` Ryo Tsuruta
3 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07 0:18 UTC (permalink / raw)
To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
I'd like to avoid making complicated existing IO schedulers and other
kernel codes and to give a choice to users whether or not to use it.
I know that you chose an approach that using compile time options to
get the same behavior as old system, but device-mapper drivers can be
added, removed and replaced while system is running.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 2:33 ` Vivek Goyal
` (2 preceding siblings ...)
[not found] ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 20:32 ` Vivek Goyal
[not found] ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 21:34 ` Andrea Righi
2009-05-07 0:18 ` Ryo Tsuruta
4 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 20:32 UTC (permalink / raw)
To: Andrew Morton, Andrea Righi
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
m-ikeda, peterz
On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> > I tend to think that a cgroup-based controller is the way to go.
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
Hi Andrea and others,
I always had this doubt in mind that any kind of 2nd level controller will
have no idea about underlying IO scheduler queues/semantics. So while it
can implement a particular cgroup policy (max bw like io-throttle or
proportional bw like dm-ioband) but there are high chances that it will
break IO scheduler's semantics in one way or other.
I had already sent out the results for dm-ioband in a separate thread.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
Here are some basic results with io-throttle. Andrea, please let me know
if you think this is procedural problem. Playing with io-throttle patches
for the first time.
I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
scheduler.
I have got one SATA drive with one partition on it.
I am trying to create one cgroup and assignn 8MB/s limit to it and launch
on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
between these tasks. Following are the results.
Following is my test script.
*******************************************************************
#!/bin/bash
mount /dev/sdb1 /mnt/sdb
mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
# Set bw limit of 8 MB/ps on sdb
echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
/cgroup/iot/test1/blockio.bandwidth-max
sync
echo 3 > /proc/sys/vm/drop_caches
echo $$ > /cgroup/iot/test1/tasks
# Launch a normal prio reader.
ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
pid1=$!
echo $pid1
# Launch an RT reader
ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2
wait $pid2
echo "RT task finished"
**********************************************************************
Test1
=====
Test two readers (one RT class and one BE class) and see how BW is
allocated with-in cgroup
With io-throttle patches
------------------------
- Two readers, first BE prio 7, second RT prio 0
234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
RT task finished
Note: See, there is no difference in the performance of RT or BE task.
Looks like these got throttled equally.
Without io-throttle patches
----------------------------
- Two readers, first BE prio 7, second RT prio 0
234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
Note: Because I can't limit the BW without io-throttle patches, so don't
worry about increased BW. But the important point is that RT task
gets much more BW than a BE prio 7 task.
Test2
====
- Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
distributed among these.
With io-throttle patches
------------------------
- Two readers, first BE prio 7, second BE prio 0
234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
High prio reader finished
Without io-throttle patches
---------------------------
- Two readers, first BE prio 7, second BE prio 0
234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
High prio reader finished
234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
Note: There is no service differentiation between prio 0 and prio 7 task
with io-throttle patches.
Test 3
======
- Run the one RT reader and one BE reader in root cgroup without any
limitations. I guess this should mean unlimited BW and behavior should
be same as with CFQ without io-throttling patches.
With io-throttle patches
=========================
Ran the test 4 times because I was getting different results in different
runs.
- Two readers, one RT prio 0 other BE prio 7
234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
RT task finished
234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
RT task finished
Note: Out of 4 runs, looks like twice it is complete priority inversion
and RT task finished after BE task. Rest of the two times, the
difference between BW of RT and BE task is much less as compared to
without patches. In fact once it was almost same.
Without io-throttle patches.
===========================
- Two readers, one RT prio 0 other BE prio 7 (4 runs)
234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
Note, How consistent the behavior is without io-throttle patches.
In summary, I think a 2nd level solution can ensure one policy on cgroups but
it will break other semantics/properties of IO scheduler with-in cgroup as
2nd level solution has no idea at run time what is the IO scheduler running
underneath and what kind of properties it has.
Andrea, please try it on your setup and see if you get similar results
on or. Hopefully it is not a configuration or test procedure issue on my
side.
Thanks
Vivek
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
>
> Thanks
> Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:34 ` Andrea Righi
0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 21:34 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> Hi Andrea and others,
>
> I always had this doubt in mind that any kind of 2nd level controller will
> have no idea about underlying IO scheduler queues/semantics. So while it
> can implement a particular cgroup policy (max bw like io-throttle or
> proportional bw like dm-ioband) but there are high chances that it will
> break IO scheduler's semantics in one way or other.
>
> I had already sent out the results for dm-ioband in a separate thread.
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
>
> Here are some basic results with io-throttle. Andrea, please let me know
> if you think this is procedural problem. Playing with io-throttle patches
> for the first time.
>
> I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> scheduler.
>
> I have got one SATA drive with one partition on it.
>
> I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> between these tasks. Following are the results.
>
> Following is my test script.
>
> *******************************************************************
> #!/bin/bash
>
> mount /dev/sdb1 /mnt/sdb
>
> mount -t cgroup -o blockio blockio /cgroup/iot/
> mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
>
> # Set bw limit of 8 MB/ps on sdb
> echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> /cgroup/iot/test1/blockio.bandwidth-max
>
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> echo $$ > /cgroup/iot/test1/tasks
>
> # Launch a normal prio reader.
> ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> pid1=$!
> echo $pid1
>
> # Launch an RT reader
> ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> pid2=$!
> echo $pid2
>
> wait $pid2
> echo "RT task finished"
> **********************************************************************
>
> Test1
> =====
> Test two readers (one RT class and one BE class) and see how BW is
> allocated with-in cgroup
>
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second RT prio 0
>
> 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> RT task finished
>
> Note: See, there is no difference in the performance of RT or BE task.
> Looks like these got throttled equally.
OK, this is coherent with the current io-throttle implementation. IO
requests are throttled without the concept of the ioprio model.
We could try to distribute the throttle using a function of each task's
ioprio, but ok, the obvious drawback is that it totally breaks the logic
used by the underlying layers.
BTW, I'm wondering, is it a very critical issue? I would say why not to
move the RT task to a different cgroup with unlimited BW? or limited BW
but with other tasks running at the same IO priority... could the cgroup
subsystem be a more flexible and customizable framework respect to the
current ioprio model?
I'm not saying we have to ignore the problem, just trying to evaluate
the impact and alternatives. And I'm still convinced that also providing
per-cgroup ioprio would be an important feature.
>
>
> Without io-throttle patches
> ----------------------------
> - Two readers, first BE prio 7, second RT prio 0
>
> 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
>
> Note: Because I can't limit the BW without io-throttle patches, so don't
> worry about increased BW. But the important point is that RT task
> gets much more BW than a BE prio 7 task.
>
> Test2
> ====
> - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> distributed among these.
>
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second BE prio 0
>
> 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> High prio reader finished
Ditto.
>
> Without io-throttle patches
> ---------------------------
> - Two readers, first BE prio 7, second BE prio 0
>
> 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> High prio reader finished
> 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
>
> Note: There is no service differentiation between prio 0 and prio 7 task
> with io-throttle patches.
>
> Test 3
> ======
> - Run the one RT reader and one BE reader in root cgroup without any
> limitations. I guess this should mean unlimited BW and behavior should
> be same as with CFQ without io-throttling patches.
>
> With io-throttle patches
> =========================
> Ran the test 4 times because I was getting different results in different
> runs.
>
> - Two readers, one RT prio 0 other BE prio 7
>
> 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> RT task finished
>
> 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
>
> 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
>
> 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> RT task finished
>
> Note: Out of 4 runs, looks like twice it is complete priority inversion
> and RT task finished after BE task. Rest of the two times, the
> difference between BW of RT and BE task is much less as compared to
> without patches. In fact once it was almost same.
This is strange. If you don't set any limit there shouldn't be any
difference respect to the other case (without io-throttle patches).
At worst a small overhead given by the task_to_iothrottle(), under
rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
reproduce this strange behaviour.
>
> Without io-throttle patches.
> ===========================
> - Two readers, one RT prio 0 other BE prio 7 (4 runs)
>
> 234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
>
> 234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
>
> 234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
>
> 234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
>
> Note, How consistent the behavior is without io-throttle patches.
>
> In summary, I think a 2nd level solution can ensure one policy on cgroups but
> it will break other semantics/properties of IO scheduler with-in cgroup as
> 2nd level solution has no idea at run time what is the IO scheduler running
> underneath and what kind of properties it has.
>
> Andrea, please try it on your setup and see if you get similar results
> on or. Hopefully it is not a configuration or test procedure issue on my
> side.
>
> Thanks
> Vivek
>
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also?
> >
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> >
> > Thanks
> > Vivek
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 20:32 ` Vivek Goyal
[not found] ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:34 ` Andrea Righi
2009-05-06 21:52 ` Vivek Goyal
1 sibling, 1 reply; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 21:34 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> Hi Andrea and others,
>
> I always had this doubt in mind that any kind of 2nd level controller will
> have no idea about underlying IO scheduler queues/semantics. So while it
> can implement a particular cgroup policy (max bw like io-throttle or
> proportional bw like dm-ioband) but there are high chances that it will
> break IO scheduler's semantics in one way or other.
>
> I had already sent out the results for dm-ioband in a separate thread.
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
>
> Here are some basic results with io-throttle. Andrea, please let me know
> if you think this is procedural problem. Playing with io-throttle patches
> for the first time.
>
> I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> scheduler.
>
> I have got one SATA drive with one partition on it.
>
> I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> between these tasks. Following are the results.
>
> Following is my test script.
>
> *******************************************************************
> #!/bin/bash
>
> mount /dev/sdb1 /mnt/sdb
>
> mount -t cgroup -o blockio blockio /cgroup/iot/
> mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
>
> # Set bw limit of 8 MB/ps on sdb
> echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> /cgroup/iot/test1/blockio.bandwidth-max
>
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> echo $$ > /cgroup/iot/test1/tasks
>
> # Launch a normal prio reader.
> ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> pid1=$!
> echo $pid1
>
> # Launch an RT reader
> ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> pid2=$!
> echo $pid2
>
> wait $pid2
> echo "RT task finished"
> **********************************************************************
>
> Test1
> =====
> Test two readers (one RT class and one BE class) and see how BW is
> allocated with-in cgroup
>
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second RT prio 0
>
> 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> RT task finished
>
> Note: See, there is no difference in the performance of RT or BE task.
> Looks like these got throttled equally.
OK, this is coherent with the current io-throttle implementation. IO
requests are throttled without the concept of the ioprio model.
We could try to distribute the throttle using a function of each task's
ioprio, but ok, the obvious drawback is that it totally breaks the logic
used by the underlying layers.
BTW, I'm wondering, is it a very critical issue? I would say why not to
move the RT task to a different cgroup with unlimited BW? or limited BW
but with other tasks running at the same IO priority... could the cgroup
subsystem be a more flexible and customizable framework respect to the
current ioprio model?
I'm not saying we have to ignore the problem, just trying to evaluate
the impact and alternatives. And I'm still convinced that also providing
per-cgroup ioprio would be an important feature.
>
>
> Without io-throttle patches
> ----------------------------
> - Two readers, first BE prio 7, second RT prio 0
>
> 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
>
> Note: Because I can't limit the BW without io-throttle patches, so don't
> worry about increased BW. But the important point is that RT task
> gets much more BW than a BE prio 7 task.
>
> Test2
> ====
> - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> distributed among these.
>
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second BE prio 0
>
> 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> High prio reader finished
Ditto.
>
> Without io-throttle patches
> ---------------------------
> - Two readers, first BE prio 7, second BE prio 0
>
> 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> High prio reader finished
> 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
>
> Note: There is no service differentiation between prio 0 and prio 7 task
> with io-throttle patches.
>
> Test 3
> ======
> - Run the one RT reader and one BE reader in root cgroup without any
> limitations. I guess this should mean unlimited BW and behavior should
> be same as with CFQ without io-throttling patches.
>
> With io-throttle patches
> =========================
> Ran the test 4 times because I was getting different results in different
> runs.
>
> - Two readers, one RT prio 0 other BE prio 7
>
> 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> RT task finished
>
> 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
>
> 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
>
> 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> RT task finished
>
> Note: Out of 4 runs, looks like twice it is complete priority inversion
> and RT task finished after BE task. Rest of the two times, the
> difference between BW of RT and BE task is much less as compared to
> without patches. In fact once it was almost same.
This is strange. If you don't set any limit there shouldn't be any
difference respect to the other case (without io-throttle patches).
At worst a small overhead given by the task_to_iothrottle(), under
rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
reproduce this strange behaviour.
>
> Without io-throttle patches.
> ===========================
> - Two readers, one RT prio 0 other BE prio 7 (4 runs)
>
> 234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
>
> 234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
>
> 234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
>
> 234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
>
> Note, How consistent the behavior is without io-throttle patches.
>
> In summary, I think a 2nd level solution can ensure one policy on cgroups but
> it will break other semantics/properties of IO scheduler with-in cgroup as
> 2nd level solution has no idea at run time what is the IO scheduler running
> underneath and what kind of properties it has.
>
> Andrea, please try it on your setup and see if you get similar results
> on or. Hopefully it is not a configuration or test procedure issue on my
> side.
>
> Thanks
> Vivek
>
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also?
> >
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> >
> > Thanks
> > Vivek
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 21:34 ` Andrea Righi
@ 2009-05-06 21:52 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:52 UTC (permalink / raw)
To: Andrea Righi
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > Hi Andrea and others,
> >
> > I always had this doubt in mind that any kind of 2nd level controller will
> > have no idea about underlying IO scheduler queues/semantics. So while it
> > can implement a particular cgroup policy (max bw like io-throttle or
> > proportional bw like dm-ioband) but there are high chances that it will
> > break IO scheduler's semantics in one way or other.
> >
> > I had already sent out the results for dm-ioband in a separate thread.
> >
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> >
> > Here are some basic results with io-throttle. Andrea, please let me know
> > if you think this is procedural problem. Playing with io-throttle patches
> > for the first time.
> >
> > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > scheduler.
> >
> > I have got one SATA drive with one partition on it.
> >
> > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > between these tasks. Following are the results.
> >
> > Following is my test script.
> >
> > *******************************************************************
> > #!/bin/bash
> >
> > mount /dev/sdb1 /mnt/sdb
> >
> > mount -t cgroup -o blockio blockio /cgroup/iot/
> > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> >
> > # Set bw limit of 8 MB/ps on sdb
> > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > /cgroup/iot/test1/blockio.bandwidth-max
> >
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > echo $$ > /cgroup/iot/test1/tasks
> >
> > # Launch a normal prio reader.
> > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > pid1=$!
> > echo $pid1
> >
> > # Launch an RT reader
> > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > pid2=$!
> > echo $pid2
> >
> > wait $pid2
> > echo "RT task finished"
> > **********************************************************************
> >
> > Test1
> > =====
> > Test two readers (one RT class and one BE class) and see how BW is
> > allocated with-in cgroup
> >
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> >
> > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > RT task finished
> >
> > Note: See, there is no difference in the performance of RT or BE task.
> > Looks like these got throttled equally.
>
> OK, this is coherent with the current io-throttle implementation. IO
> requests are throttled without the concept of the ioprio model.
>
> We could try to distribute the throttle using a function of each task's
> ioprio, but ok, the obvious drawback is that it totally breaks the logic
> used by the underlying layers.
>
> BTW, I'm wondering, is it a very critical issue? I would say why not to
> move the RT task to a different cgroup with unlimited BW? or limited BW
> but with other tasks running at the same IO priority...
So one of hypothetical use case probably could be following. Somebody
is having a hosted server and customers are going to get there
applications running in a particular cgroup with a limit on max bw.
root
/ | \
cust1 cust2 cust3
(20 MB/s) (40MB/s) (30MB/s)
Now all three customers will run their own applications/virtual machines
in their respective groups with upper limits. Will we say to these that
all your tasks will be considered as same class and same prio level.
Assume cust1 is running a hypothetical application which creates multiple
threads and assigns these threads different priorities based on its needs
at run time. How would we handle this thing?
You can't collect all the RT tasks from all customers and move these to a
single cgroup. Or ask customers to separate out their tasks based on
priority level and give them multiple groups of different priority.
> could the cgroup
> subsystem be a more flexible and customizable framework respect to the
> current ioprio model?
>
> I'm not saying we have to ignore the problem, just trying to evaluate
> the impact and alternatives. And I'm still convinced that also providing
> per-cgroup ioprio would be an important feature.
>
> >
> >
> > Without io-throttle patches
> > ----------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> >
> > 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> >
> > Note: Because I can't limit the BW without io-throttle patches, so don't
> > worry about increased BW. But the important point is that RT task
> > gets much more BW than a BE prio 7 task.
> >
> > Test2
> > ====
> > - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> > distributed among these.
> >
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> >
> > 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> > High prio reader finished
>
> Ditto.
>
> >
> > Without io-throttle patches
> > ---------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> >
> > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > High prio reader finished
> > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> >
> > Note: There is no service differentiation between prio 0 and prio 7 task
> > with io-throttle patches.
> >
> > Test 3
> > ======
> > - Run the one RT reader and one BE reader in root cgroup without any
> > limitations. I guess this should mean unlimited BW and behavior should
> > be same as with CFQ without io-throttling patches.
> >
> > With io-throttle patches
> > =========================
> > Ran the test 4 times because I was getting different results in different
> > runs.
> >
> > - Two readers, one RT prio 0 other BE prio 7
> >
> > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > RT task finished
> >
> > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> >
> > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> >
> > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > RT task finished
> >
> > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > and RT task finished after BE task. Rest of the two times, the
> > difference between BW of RT and BE task is much less as compared to
> > without patches. In fact once it was almost same.
>
> This is strange. If you don't set any limit there shouldn't be any
> difference respect to the other case (without io-throttle patches).
>
> At worst a small overhead given by the task_to_iothrottle(), under
> rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> reproduce this strange behaviour.
Ya, I also found this strange. At least in root group there should not be
any behavior change (at max one might expect little drop in throughput
because of extra code).
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
@ 2009-05-06 21:52 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:52 UTC (permalink / raw)
To: Andrea Righi
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > Hi Andrea and others,
> >
> > I always had this doubt in mind that any kind of 2nd level controller will
> > have no idea about underlying IO scheduler queues/semantics. So while it
> > can implement a particular cgroup policy (max bw like io-throttle or
> > proportional bw like dm-ioband) but there are high chances that it will
> > break IO scheduler's semantics in one way or other.
> >
> > I had already sent out the results for dm-ioband in a separate thread.
> >
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> >
> > Here are some basic results with io-throttle. Andrea, please let me know
> > if you think this is procedural problem. Playing with io-throttle patches
> > for the first time.
> >
> > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > scheduler.
> >
> > I have got one SATA drive with one partition on it.
> >
> > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > between these tasks. Following are the results.
> >
> > Following is my test script.
> >
> > *******************************************************************
> > #!/bin/bash
> >
> > mount /dev/sdb1 /mnt/sdb
> >
> > mount -t cgroup -o blockio blockio /cgroup/iot/
> > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> >
> > # Set bw limit of 8 MB/ps on sdb
> > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > /cgroup/iot/test1/blockio.bandwidth-max
> >
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > echo $$ > /cgroup/iot/test1/tasks
> >
> > # Launch a normal prio reader.
> > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > pid1=$!
> > echo $pid1
> >
> > # Launch an RT reader
> > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > pid2=$!
> > echo $pid2
> >
> > wait $pid2
> > echo "RT task finished"
> > **********************************************************************
> >
> > Test1
> > =====
> > Test two readers (one RT class and one BE class) and see how BW is
> > allocated with-in cgroup
> >
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> >
> > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > RT task finished
> >
> > Note: See, there is no difference in the performance of RT or BE task.
> > Looks like these got throttled equally.
>
> OK, this is coherent with the current io-throttle implementation. IO
> requests are throttled without the concept of the ioprio model.
>
> We could try to distribute the throttle using a function of each task's
> ioprio, but ok, the obvious drawback is that it totally breaks the logic
> used by the underlying layers.
>
> BTW, I'm wondering, is it a very critical issue? I would say why not to
> move the RT task to a different cgroup with unlimited BW? or limited BW
> but with other tasks running at the same IO priority...
So one of hypothetical use case probably could be following. Somebody
is having a hosted server and customers are going to get there
applications running in a particular cgroup with a limit on max bw.
root
/ | \
cust1 cust2 cust3
(20 MB/s) (40MB/s) (30MB/s)
Now all three customers will run their own applications/virtual machines
in their respective groups with upper limits. Will we say to these that
all your tasks will be considered as same class and same prio level.
Assume cust1 is running a hypothetical application which creates multiple
threads and assigns these threads different priorities based on its needs
at run time. How would we handle this thing?
You can't collect all the RT tasks from all customers and move these to a
single cgroup. Or ask customers to separate out their tasks based on
priority level and give them multiple groups of different priority.
> could the cgroup
> subsystem be a more flexible and customizable framework respect to the
> current ioprio model?
>
> I'm not saying we have to ignore the problem, just trying to evaluate
> the impact and alternatives. And I'm still convinced that also providing
> per-cgroup ioprio would be an important feature.
>
> >
> >
> > Without io-throttle patches
> > ----------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> >
> > 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> >
> > Note: Because I can't limit the BW without io-throttle patches, so don't
> > worry about increased BW. But the important point is that RT task
> > gets much more BW than a BE prio 7 task.
> >
> > Test2
> > ====
> > - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> > distributed among these.
> >
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> >
> > 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> > High prio reader finished
>
> Ditto.
>
> >
> > Without io-throttle patches
> > ---------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> >
> > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > High prio reader finished
> > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> >
> > Note: There is no service differentiation between prio 0 and prio 7 task
> > with io-throttle patches.
> >
> > Test 3
> > ======
> > - Run the one RT reader and one BE reader in root cgroup without any
> > limitations. I guess this should mean unlimited BW and behavior should
> > be same as with CFQ without io-throttling patches.
> >
> > With io-throttle patches
> > =========================
> > Ran the test 4 times because I was getting different results in different
> > runs.
> >
> > - Two readers, one RT prio 0 other BE prio 7
> >
> > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > RT task finished
> >
> > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> >
> > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> >
> > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > RT task finished
> >
> > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > and RT task finished after BE task. Rest of the two times, the
> > difference between BW of RT and BE task is much less as compared to
> > without patches. In fact once it was almost same.
>
> This is strange. If you don't set any limit there shouldn't be any
> difference respect to the other case (without io-throttle patches).
>
> At worst a small overhead given by the task_to_iothrottle(), under
> rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> reproduce this strange behaviour.
Ya, I also found this strange. At least in root group there should not be
any behavior change (at max one might expect little drop in throughput
because of extra code).
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 21:52 ` Vivek Goyal
(?)
@ 2009-05-06 22:35 ` Andrea Righi
2009-05-07 1:48 ` Ryo Tsuruta
2009-05-07 1:48 ` Ryo Tsuruta
-1 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:35 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > Hi Andrea and others,
> > >
> > > I always had this doubt in mind that any kind of 2nd level controller will
> > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > can implement a particular cgroup policy (max bw like io-throttle or
> > > proportional bw like dm-ioband) but there are high chances that it will
> > > break IO scheduler's semantics in one way or other.
> > >
> > > I had already sent out the results for dm-ioband in a separate thread.
> > >
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > >
> > > Here are some basic results with io-throttle. Andrea, please let me know
> > > if you think this is procedural problem. Playing with io-throttle patches
> > > for the first time.
> > >
> > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > scheduler.
> > >
> > > I have got one SATA drive with one partition on it.
> > >
> > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > between these tasks. Following are the results.
> > >
> > > Following is my test script.
> > >
> > > *******************************************************************
> > > #!/bin/bash
> > >
> > > mount /dev/sdb1 /mnt/sdb
> > >
> > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > >
> > > # Set bw limit of 8 MB/ps on sdb
> > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > /cgroup/iot/test1/blockio.bandwidth-max
> > >
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > echo $$ > /cgroup/iot/test1/tasks
> > >
> > > # Launch a normal prio reader.
> > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > pid1=$!
> > > echo $pid1
> > >
> > > # Launch an RT reader
> > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > pid2=$!
> > > echo $pid2
> > >
> > > wait $pid2
> > > echo "RT task finished"
> > > **********************************************************************
> > >
> > > Test1
> > > =====
> > > Test two readers (one RT class and one BE class) and see how BW is
> > > allocated with-in cgroup
> > >
> > > With io-throttle patches
> > > ------------------------
> > > - Two readers, first BE prio 7, second RT prio 0
> > >
> > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > RT task finished
> > >
> > > Note: See, there is no difference in the performance of RT or BE task.
> > > Looks like these got throttled equally.
> >
> > OK, this is coherent with the current io-throttle implementation. IO
> > requests are throttled without the concept of the ioprio model.
> >
> > We could try to distribute the throttle using a function of each task's
> > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > used by the underlying layers.
> >
> > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > move the RT task to a different cgroup with unlimited BW? or limited BW
> > but with other tasks running at the same IO priority...
>
> So one of hypothetical use case probably could be following. Somebody
> is having a hosted server and customers are going to get there
> applications running in a particular cgroup with a limit on max bw.
>
> root
> / | \
> cust1 cust2 cust3
> (20 MB/s) (40MB/s) (30MB/s)
>
> Now all three customers will run their own applications/virtual machines
> in their respective groups with upper limits. Will we say to these that
> all your tasks will be considered as same class and same prio level.
>
> Assume cust1 is running a hypothetical application which creates multiple
> threads and assigns these threads different priorities based on its needs
> at run time. How would we handle this thing?
>
> You can't collect all the RT tasks from all customers and move these to a
> single cgroup. Or ask customers to separate out their tasks based on
> priority level and give them multiple groups of different priority.
Clear.
Unfortunately, I think, with absolute BW limits at a certain point, if
we hit the limit, we need to block the IO request. That's the same
either, when we dispatch or submit the request. And the risk is to break
the logic of the IO priorities and fall in the classic priority
inversion problem.
The difference is that probably working at the CFQ level gives a better
control so we can handle these cases appropriately and avoid the
priority inversion problems.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 22:35 ` Andrea Righi
@ 2009-05-07 1:48 ` Ryo Tsuruta
2009-05-07 1:48 ` Ryo Tsuruta
1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07 1:48 UTC (permalink / raw)
To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Thu, 7 May 2009 00:35:13 +0200
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > > Hi Andrea and others,
> > > >
> > > > I always had this doubt in mind that any kind of 2nd level controller will
> > > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > > can implement a particular cgroup policy (max bw like io-throttle or
> > > > proportional bw like dm-ioband) but there are high chances that it will
> > > > break IO scheduler's semantics in one way or other.
> > > >
> > > > I had already sent out the results for dm-ioband in a separate thread.
> > > >
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > >
> > > > Here are some basic results with io-throttle. Andrea, please let me know
> > > > if you think this is procedural problem. Playing with io-throttle patches
> > > > for the first time.
> > > >
> > > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > > scheduler.
> > > >
> > > > I have got one SATA drive with one partition on it.
> > > >
> > > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > > between these tasks. Following are the results.
> > > >
> > > > Following is my test script.
> > > >
> > > > *******************************************************************
> > > > #!/bin/bash
> > > >
> > > > mount /dev/sdb1 /mnt/sdb
> > > >
> > > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > >
> > > > # Set bw limit of 8 MB/ps on sdb
> > > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > > /cgroup/iot/test1/blockio.bandwidth-max
> > > >
> > > > sync
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > >
> > > > echo $$ > /cgroup/iot/test1/tasks
> > > >
> > > > # Launch a normal prio reader.
> > > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > > pid1=$!
> > > > echo $pid1
> > > >
> > > > # Launch an RT reader
> > > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > > pid2=$!
> > > > echo $pid2
> > > >
> > > > wait $pid2
> > > > echo "RT task finished"
> > > > **********************************************************************
> > > >
> > > > Test1
> > > > =====
> > > > Test two readers (one RT class and one BE class) and see how BW is
> > > > allocated with-in cgroup
> > > >
> > > > With io-throttle patches
> > > > ------------------------
> > > > - Two readers, first BE prio 7, second RT prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > > RT task finished
> > > >
> > > > Note: See, there is no difference in the performance of RT or BE task.
> > > > Looks like these got throttled equally.
> > >
> > > OK, this is coherent with the current io-throttle implementation. IO
> > > requests are throttled without the concept of the ioprio model.
> > >
> > > We could try to distribute the throttle using a function of each task's
> > > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > > used by the underlying layers.
> > >
> > > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > > move the RT task to a different cgroup with unlimited BW? or limited BW
> > > but with other tasks running at the same IO priority...
> >
> > So one of hypothetical use case probably could be following. Somebody
> > is having a hosted server and customers are going to get there
> > applications running in a particular cgroup with a limit on max bw.
> >
> > root
> > / | \
> > cust1 cust2 cust3
> > (20 MB/s) (40MB/s) (30MB/s)
> >
> > Now all three customers will run their own applications/virtual machines
> > in their respective groups with upper limits. Will we say to these that
> > all your tasks will be considered as same class and same prio level.
> >
> > Assume cust1 is running a hypothetical application which creates multiple
> > threads and assigns these threads different priorities based on its needs
> > at run time. How would we handle this thing?
> >
> > You can't collect all the RT tasks from all customers and move these to a
> > single cgroup. Or ask customers to separate out their tasks based on
> > priority level and give them multiple groups of different priority.
>
> Clear.
>
> Unfortunately, I think, with absolute BW limits at a certain point, if
> we hit the limit, we need to block the IO request. That's the same
> either, when we dispatch or submit the request. And the risk is to break
> the logic of the IO priorities and fall in the classic priority
> inversion problem.
>
> The difference is that probably working at the CFQ level gives a better
> control so we can handle these cases appropriately and avoid the
> priority inversion problems.
>
> Thanks,
> -Andrea
If RT tasks in cust1 issue IOs intensively, are IOs issued from BE
tasks running on cust2 and cust3 suppressed and cust1 can use whole
bandwidth?
I think that CFQ's class and priority should be preserved within a
given bandwidth to each cgroup.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 22:35 ` Andrea Righi
2009-05-07 1:48 ` Ryo Tsuruta
@ 2009-05-07 1:48 ` Ryo Tsuruta
1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07 1:48 UTC (permalink / raw)
To: righi.andrea
Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
snitzer, m-ikeda, peterz
From: Andrea Righi <righi.andrea@gmail.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Thu, 7 May 2009 00:35:13 +0200
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > > Hi Andrea and others,
> > > >
> > > > I always had this doubt in mind that any kind of 2nd level controller will
> > > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > > can implement a particular cgroup policy (max bw like io-throttle or
> > > > proportional bw like dm-ioband) but there are high chances that it will
> > > > break IO scheduler's semantics in one way or other.
> > > >
> > > > I had already sent out the results for dm-ioband in a separate thread.
> > > >
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > >
> > > > Here are some basic results with io-throttle. Andrea, please let me know
> > > > if you think this is procedural problem. Playing with io-throttle patches
> > > > for the first time.
> > > >
> > > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > > scheduler.
> > > >
> > > > I have got one SATA drive with one partition on it.
> > > >
> > > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > > between these tasks. Following are the results.
> > > >
> > > > Following is my test script.
> > > >
> > > > *******************************************************************
> > > > #!/bin/bash
> > > >
> > > > mount /dev/sdb1 /mnt/sdb
> > > >
> > > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > >
> > > > # Set bw limit of 8 MB/ps on sdb
> > > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > > /cgroup/iot/test1/blockio.bandwidth-max
> > > >
> > > > sync
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > >
> > > > echo $$ > /cgroup/iot/test1/tasks
> > > >
> > > > # Launch a normal prio reader.
> > > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > > pid1=$!
> > > > echo $pid1
> > > >
> > > > # Launch an RT reader
> > > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > > pid2=$!
> > > > echo $pid2
> > > >
> > > > wait $pid2
> > > > echo "RT task finished"
> > > > **********************************************************************
> > > >
> > > > Test1
> > > > =====
> > > > Test two readers (one RT class and one BE class) and see how BW is
> > > > allocated with-in cgroup
> > > >
> > > > With io-throttle patches
> > > > ------------------------
> > > > - Two readers, first BE prio 7, second RT prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > > RT task finished
> > > >
> > > > Note: See, there is no difference in the performance of RT or BE task.
> > > > Looks like these got throttled equally.
> > >
> > > OK, this is coherent with the current io-throttle implementation. IO
> > > requests are throttled without the concept of the ioprio model.
> > >
> > > We could try to distribute the throttle using a function of each task's
> > > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > > used by the underlying layers.
> > >
> > > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > > move the RT task to a different cgroup with unlimited BW? or limited BW
> > > but with other tasks running at the same IO priority...
> >
> > So one of hypothetical use case probably could be following. Somebody
> > is having a hosted server and customers are going to get there
> > applications running in a particular cgroup with a limit on max bw.
> >
> > root
> > / | \
> > cust1 cust2 cust3
> > (20 MB/s) (40MB/s) (30MB/s)
> >
> > Now all three customers will run their own applications/virtual machines
> > in their respective groups with upper limits. Will we say to these that
> > all your tasks will be considered as same class and same prio level.
> >
> > Assume cust1 is running a hypothetical application which creates multiple
> > threads and assigns these threads different priorities based on its needs
> > at run time. How would we handle this thing?
> >
> > You can't collect all the RT tasks from all customers and move these to a
> > single cgroup. Or ask customers to separate out their tasks based on
> > priority level and give them multiple groups of different priority.
>
> Clear.
>
> Unfortunately, I think, with absolute BW limits at a certain point, if
> we hit the limit, we need to block the IO request. That's the same
> either, when we dispatch or submit the request. And the risk is to break
> the logic of the IO priorities and fall in the classic priority
> inversion problem.
>
> The difference is that probably working at the CFQ level gives a better
> control so we can handle these cases appropriately and avoid the
> priority inversion problems.
>
> Thanks,
> -Andrea
If RT tasks in cust1 issue IOs intensively, are IOs issued from BE
tasks running on cust2 and cust3 suppressed and cust1 can use whole
bandwidth?
I think that CFQ's class and priority should be preserved within a
given bandwidth to each cgroup.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 22:35 ` Andrea Righi
2009-05-07 9:04 ` Andrea Righi
1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:35 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > Hi Andrea and others,
> > >
> > > I always had this doubt in mind that any kind of 2nd level controller will
> > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > can implement a particular cgroup policy (max bw like io-throttle or
> > > proportional bw like dm-ioband) but there are high chances that it will
> > > break IO scheduler's semantics in one way or other.
> > >
> > > I had already sent out the results for dm-ioband in a separate thread.
> > >
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > >
> > > Here are some basic results with io-throttle. Andrea, please let me know
> > > if you think this is procedural problem. Playing with io-throttle patches
> > > for the first time.
> > >
> > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > scheduler.
> > >
> > > I have got one SATA drive with one partition on it.
> > >
> > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > between these tasks. Following are the results.
> > >
> > > Following is my test script.
> > >
> > > *******************************************************************
> > > #!/bin/bash
> > >
> > > mount /dev/sdb1 /mnt/sdb
> > >
> > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > >
> > > # Set bw limit of 8 MB/ps on sdb
> > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > /cgroup/iot/test1/blockio.bandwidth-max
> > >
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > echo $$ > /cgroup/iot/test1/tasks
> > >
> > > # Launch a normal prio reader.
> > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > pid1=$!
> > > echo $pid1
> > >
> > > # Launch an RT reader
> > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > pid2=$!
> > > echo $pid2
> > >
> > > wait $pid2
> > > echo "RT task finished"
> > > **********************************************************************
> > >
> > > Test1
> > > =====
> > > Test two readers (one RT class and one BE class) and see how BW is
> > > allocated with-in cgroup
> > >
> > > With io-throttle patches
> > > ------------------------
> > > - Two readers, first BE prio 7, second RT prio 0
> > >
> > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > RT task finished
> > >
> > > Note: See, there is no difference in the performance of RT or BE task.
> > > Looks like these got throttled equally.
> >
> > OK, this is coherent with the current io-throttle implementation. IO
> > requests are throttled without the concept of the ioprio model.
> >
> > We could try to distribute the throttle using a function of each task's
> > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > used by the underlying layers.
> >
> > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > move the RT task to a different cgroup with unlimited BW? or limited BW
> > but with other tasks running at the same IO priority...
>
> So one of hypothetical use case probably could be following. Somebody
> is having a hosted server and customers are going to get there
> applications running in a particular cgroup with a limit on max bw.
>
> root
> / | \
> cust1 cust2 cust3
> (20 MB/s) (40MB/s) (30MB/s)
>
> Now all three customers will run their own applications/virtual machines
> in their respective groups with upper limits. Will we say to these that
> all your tasks will be considered as same class and same prio level.
>
> Assume cust1 is running a hypothetical application which creates multiple
> threads and assigns these threads different priorities based on its needs
> at run time. How would we handle this thing?
>
> You can't collect all the RT tasks from all customers and move these to a
> single cgroup. Or ask customers to separate out their tasks based on
> priority level and give them multiple groups of different priority.
Clear.
Unfortunately, I think, with absolute BW limits at a certain point, if
we hit the limit, we need to block the IO request. That's the same
either, when we dispatch or submit the request. And the risk is to break
the logic of the IO priorities and fall in the classic priority
inversion problem.
The difference is that probably working at the CFQ level gives a better
control so we can handle these cases appropriately and avoid the
priority inversion problems.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 22:35 ` Andrea Righi
@ 2009-05-07 9:04 ` Andrea Righi
1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 9:04 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > Without io-throttle patches
> > > ---------------------------
> > > - Two readers, first BE prio 7, second BE prio 0
> > >
> > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > High prio reader finished
> > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > >
> > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > with io-throttle patches.
> > >
> > > Test 3
> > > ======
> > > - Run the one RT reader and one BE reader in root cgroup without any
> > > limitations. I guess this should mean unlimited BW and behavior should
> > > be same as with CFQ without io-throttling patches.
> > >
> > > With io-throttle patches
> > > =========================
> > > Ran the test 4 times because I was getting different results in different
> > > runs.
> > >
> > > - Two readers, one RT prio 0 other BE prio 7
> > >
> > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > RT task finished
> > >
> > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > >
> > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > >
> > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > RT task finished
> > >
> > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > and RT task finished after BE task. Rest of the two times, the
> > > difference between BW of RT and BE task is much less as compared to
> > > without patches. In fact once it was almost same.
> >
> > This is strange. If you don't set any limit there shouldn't be any
> > difference respect to the other case (without io-throttle patches).
> >
> > At worst a small overhead given by the task_to_iothrottle(), under
> > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > reproduce this strange behaviour.
>
> Ya, I also found this strange. At least in root group there should not be
> any behavior change (at max one might expect little drop in throughput
> because of extra code).
Hi Vivek,
I'm not able to reproduce the strange behaviour above.
Which commands are you running exactly? is the system isolated (stupid
question) no cron or background tasks doing IO during the tests?
Following the script I've used:
$ cat test.sh
#!/bin/sh
echo 3 > /proc/sys/vm/drop_caches
ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
for i in 1 2; do
wait
done
And the results on my PC:
2.6.30-rc4
~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
RT: 4:blockio:/
The difference seems to be just the expected overhead.
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 21:52 ` Vivek Goyal
` (2 preceding siblings ...)
(?)
@ 2009-05-07 9:04 ` Andrea Righi
2009-05-07 12:22 ` Andrea Righi
` (3 more replies)
-1 siblings, 4 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 9:04 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > Without io-throttle patches
> > > ---------------------------
> > > - Two readers, first BE prio 7, second BE prio 0
> > >
> > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > High prio reader finished
> > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > >
> > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > with io-throttle patches.
> > >
> > > Test 3
> > > ======
> > > - Run the one RT reader and one BE reader in root cgroup without any
> > > limitations. I guess this should mean unlimited BW and behavior should
> > > be same as with CFQ without io-throttling patches.
> > >
> > > With io-throttle patches
> > > =========================
> > > Ran the test 4 times because I was getting different results in different
> > > runs.
> > >
> > > - Two readers, one RT prio 0 other BE prio 7
> > >
> > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > RT task finished
> > >
> > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > >
> > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > >
> > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > RT task finished
> > >
> > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > and RT task finished after BE task. Rest of the two times, the
> > > difference between BW of RT and BE task is much less as compared to
> > > without patches. In fact once it was almost same.
> >
> > This is strange. If you don't set any limit there shouldn't be any
> > difference respect to the other case (without io-throttle patches).
> >
> > At worst a small overhead given by the task_to_iothrottle(), under
> > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > reproduce this strange behaviour.
>
> Ya, I also found this strange. At least in root group there should not be
> any behavior change (at max one might expect little drop in throughput
> because of extra code).
Hi Vivek,
I'm not able to reproduce the strange behaviour above.
Which commands are you running exactly? is the system isolated (stupid
question) no cron or background tasks doing IO during the tests?
Following the script I've used:
$ cat test.sh
#!/bin/sh
echo 3 > /proc/sys/vm/drop_caches
ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
for i in 1 2; do
wait
done
And the results on my PC:
2.6.30-rc4
~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
RT: 4:blockio:/
The difference seems to be just the expected overhead.
-Andrea
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 9:04 ` Andrea Righi
@ 2009-05-07 12:22 ` Andrea Righi
2009-05-07 12:22 ` Andrea Righi
` (2 subsequent siblings)
3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 12:22 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > >
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > > with io-throttle patches.
> > > >
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > > limitations. I guess this should mean unlimited BW and behavior should
> > > > be same as with CFQ without io-throttling patches.
> > > >
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > >
> > > > - Two readers, one RT prio 0 other BE prio 7
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > >
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > > and RT task finished after BE task. Rest of the two times, the
> > > > difference between BW of RT and BE task is much less as compared to
> > > > without patches. In fact once it was almost same.
> > >
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > >
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> >
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
>
> Hi Vivek,
>
> I'm not able to reproduce the strange behaviour above.
>
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
>
> Following the script I've used:
>
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> wait
> done
>
> And the results on my PC:
>
> 2.6.30-rc4
> ~~~~~~~~~~
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
>
> 2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
> RT: 4:blockio:/
>
> The difference seems to be just the expected overhead.
BTW, it is possible to reduce the io-throttle overhead even more for non
io-throttle users (also when CONFIG_CGROUP_IO_THROTTLE is enabled) using
the trick below.
2.6.30-rc4 + io-throttle + following patch, no BW limit, tasks in root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 17.462 s, 14.1 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.7865 s, 20.8 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.8375 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9148 s, 20.6 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 19.6826 s, 12.5 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8715 s, 20.7 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.9152 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8925 s, 20.6 MB/s
RT: 4:blockio:/
[ To be applied on top of io-throttle v16 ]
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/blk-io-throttle.c | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e2dfd24..8b45c71 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -131,6 +131,14 @@ struct iothrottle_node {
struct iothrottle_stat stat;
};
+/*
+ * This is a trick to reduce the unneded overhead when io-throttle is not used
+ * at all. We use a counter of the io-throttle rules; if the counter is zero,
+ * we immediately return from the io-throttle hooks, without accounting IO and
+ * without checking if we need to apply some limiting rules.
+ */
+static atomic_t iothrottle_node_count __read_mostly;
+
/**
* struct iothrottle - throttling rules for a cgroup
* @css: pointer to the cgroup state
@@ -193,6 +201,7 @@ static void iothrottle_insert_node(struct iothrottle *iot,
{
WARN_ON_ONCE(!cgroup_is_locked());
list_add_rcu(&n->node, &iot->list);
+ atomic_inc(&iothrottle_node_count);
}
/*
@@ -214,6 +223,7 @@ iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
{
WARN_ON_ONCE(!cgroup_is_locked());
list_del_rcu(&n->node);
+ atomic_dec(&iothrottle_node_count);
}
/*
@@ -250,8 +260,10 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
* reference to the list.
*/
if (!list_empty(&iot->list))
- list_for_each_entry_safe(n, p, &iot->list, node)
+ list_for_each_entry_safe(n, p, &iot->list, node) {
kfree(n);
+ atomic_dec(&iothrottle_node_count);
+ }
kfree(iot);
}
@@ -836,7 +848,7 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
unsigned long long sleep;
int type, can_sleep = 1;
- if (iothrottle_disabled())
+ if (iothrottle_disabled() || !atomic_read(&iothrottle_node_count))
return 0;
if (unlikely(!bdev))
return 0;
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 9:04 ` Andrea Righi
2009-05-07 12:22 ` Andrea Righi
@ 2009-05-07 12:22 ` Andrea Righi
2009-05-07 14:11 ` Vivek Goyal
2009-05-07 14:11 ` Vivek Goyal
3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 12:22 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > >
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > > with io-throttle patches.
> > > >
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > > limitations. I guess this should mean unlimited BW and behavior should
> > > > be same as with CFQ without io-throttling patches.
> > > >
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > >
> > > > - Two readers, one RT prio 0 other BE prio 7
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > >
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > > and RT task finished after BE task. Rest of the two times, the
> > > > difference between BW of RT and BE task is much less as compared to
> > > > without patches. In fact once it was almost same.
> > >
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > >
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> >
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
>
> Hi Vivek,
>
> I'm not able to reproduce the strange behaviour above.
>
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
>
> Following the script I've used:
>
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> wait
> done
>
> And the results on my PC:
>
> 2.6.30-rc4
> ~~~~~~~~~~
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
>
> 2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
> RT: 4:blockio:/
>
> The difference seems to be just the expected overhead.
BTW, it is possible to reduce the io-throttle overhead even more for non
io-throttle users (also when CONFIG_CGROUP_IO_THROTTLE is enabled) using
the trick below.
2.6.30-rc4 + io-throttle + following patch, no BW limit, tasks in root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 17.462 s, 14.1 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.7865 s, 20.8 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.8375 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9148 s, 20.6 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 19.6826 s, 12.5 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8715 s, 20.7 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.9152 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8925 s, 20.6 MB/s
RT: 4:blockio:/
[ To be applied on top of io-throttle v16 ]
Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
block/blk-io-throttle.c | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e2dfd24..8b45c71 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -131,6 +131,14 @@ struct iothrottle_node {
struct iothrottle_stat stat;
};
+/*
+ * This is a trick to reduce the unneded overhead when io-throttle is not used
+ * at all. We use a counter of the io-throttle rules; if the counter is zero,
+ * we immediately return from the io-throttle hooks, without accounting IO and
+ * without checking if we need to apply some limiting rules.
+ */
+static atomic_t iothrottle_node_count __read_mostly;
+
/**
* struct iothrottle - throttling rules for a cgroup
* @css: pointer to the cgroup state
@@ -193,6 +201,7 @@ static void iothrottle_insert_node(struct iothrottle *iot,
{
WARN_ON_ONCE(!cgroup_is_locked());
list_add_rcu(&n->node, &iot->list);
+ atomic_inc(&iothrottle_node_count);
}
/*
@@ -214,6 +223,7 @@ iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
{
WARN_ON_ONCE(!cgroup_is_locked());
list_del_rcu(&n->node);
+ atomic_dec(&iothrottle_node_count);
}
/*
@@ -250,8 +260,10 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
* reference to the list.
*/
if (!list_empty(&iot->list))
- list_for_each_entry_safe(n, p, &iot->list, node)
+ list_for_each_entry_safe(n, p, &iot->list, node) {
kfree(n);
+ atomic_dec(&iothrottle_node_count);
+ }
kfree(iot);
}
@@ -836,7 +848,7 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
unsigned long long sleep;
int type, can_sleep = 1;
- if (iothrottle_disabled())
+ if (iothrottle_disabled() || !atomic_read(&iothrottle_node_count))
return 0;
if (unlikely(!bdev))
return 0;
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 9:04 ` Andrea Righi
2009-05-07 12:22 ` Andrea Righi
2009-05-07 12:22 ` Andrea Righi
@ 2009-05-07 14:11 ` Vivek Goyal
2009-05-07 14:11 ` Vivek Goyal
3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:11 UTC (permalink / raw)
To: Andrea Righi
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton
On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > >
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > > with io-throttle patches.
> > > >
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > > limitations. I guess this should mean unlimited BW and behavior should
> > > > be same as with CFQ without io-throttling patches.
> > > >
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > >
> > > > - Two readers, one RT prio 0 other BE prio 7
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > >
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > > and RT task finished after BE task. Rest of the two times, the
> > > > difference between BW of RT and BE task is much less as compared to
> > > > without patches. In fact once it was almost same.
> > >
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > >
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> >
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
>
> Hi Vivek,
>
> I'm not able to reproduce the strange behaviour above.
>
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
>
> Following the script I've used:
>
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> wait
> done
>
> And the results on my PC:
>
[..]
> The difference seems to be just the expected overhead.
Hm..., something is really amiss here. I took your scripts and ran on
my system and I still see the issue. There is nothing else running on the
system and it is isolated.
2.6.30-rc4 + io-throttle patches V16
===================================
It is freshly booted system with nothing extra running on it. This is a
4 core system.
Disk1
=====
This is a fast disk which supports queue depth of 31.
Following is the output picked from dmesg for my device properties.
[ 3.016099] sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors: (250
GB/232 GiB)
[ 3.016188] sd 2:0:0:0: Attached scsi generic sg2 type 0
Following are the results of 4 runs of your script. (Just changed the
script to read right file on my system if=/mnt/sdb/zerofile1).
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 4.38435 s, 53.4 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.20706 s, 45.0 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.12953 s, 45.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.23573 s, 44.7 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 3.54644 s, 66.0 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.19406 s, 45.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.21908 s, 44.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.23802 s, 44.7 MB/s
Disk2
=====
This is a relatively slower disk with no command queuing.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.06471 s, 33.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.01571 s, 29.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.89043 s, 29.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.03428 s, 29.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.38942 s, 31.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.01146 s, 29.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.78351 s, 30.1 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.06292 s, 29.0 MB/s
Disk3
=====
This is an Intel SSD.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.993735 s, 236 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98772 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.8616 s, 126 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98499 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.01174 s, 231 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.99143 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.96132 s, 119 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.97746 s, 118 MB/s
Results without io-throttle patches (vanilla 2.6.30-rc4)
========================================================
Disk 1
======
This is relatively faster SATA drive with command queuing enabled.
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.84065 s, 82.4 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.30087 s, 44.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69688 s, 86.8 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.18175 s, 45.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.73279 s, 85.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.21803 s, 44.9 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69304 s, 87.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.17821 s, 45.2 MB/s
Disk 2
======
Slower disk with no command queuing.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.29453 s, 54.5 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.04978 s, 29.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.96924 s, 59.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.74984 s, 30.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.11254 s, 56.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.8678 s, 29.8 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.95979 s, 59.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.73976 s, 30.3 MB/s
Disk3
=====
Intel SSD
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.996762 s, 235 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93268 s, 121 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.98511 s, 238 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.92481 s, 122 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.986981 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.9312 s, 121 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
So I am still seeing the issue with differnt kind of disks also. At this point
of time I am really not sure why I am seeing such results.
I have following patches applied on 30-rc4 (V16).
3954-vivek.goyal2008-res_counter-introduce-ratelimiting-attributes.patch
3955-vivek.goyal2008-page_cgroup-provide-a-generic-page-tracking-infrastructure.patch
3956-vivek.goyal2008-io-throttle-controller-infrastructure.patch
3957-vivek.goyal2008-kiothrottled-throttle-buffered-io.patch
3958-vivek.goyal2008-io-throttle-instrumentation.patch
3959-vivek.goyal2008-io-throttle-export-per-task-statistics-to-userspace.patch
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 9:04 ` Andrea Righi
` (2 preceding siblings ...)
2009-05-07 14:11 ` Vivek Goyal
@ 2009-05-07 14:11 ` Vivek Goyal
[not found] ` <20090507141126.GA9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
3 siblings, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:11 UTC (permalink / raw)
To: Andrea Righi
Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
agk, dm-devel, snitzer, m-ikeda, peterz
On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > >
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > > with io-throttle patches.
> > > >
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > > limitations. I guess this should mean unlimited BW and behavior should
> > > > be same as with CFQ without io-throttling patches.
> > > >
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > >
> > > > - Two readers, one RT prio 0 other BE prio 7
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > >
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > >
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > >
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > > and RT task finished after BE task. Rest of the two times, the
> > > > difference between BW of RT and BE task is much less as compared to
> > > > without patches. In fact once it was almost same.
> > >
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > >
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> >
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
>
> Hi Vivek,
>
> I'm not able to reproduce the strange behaviour above.
>
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
>
> Following the script I've used:
>
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> wait
> done
>
> And the results on my PC:
>
[..]
> The difference seems to be just the expected overhead.
Hm..., something is really amiss here. I took your scripts and ran on
my system and I still see the issue. There is nothing else running on the
system and it is isolated.
2.6.30-rc4 + io-throttle patches V16
===================================
It is freshly booted system with nothing extra running on it. This is a
4 core system.
Disk1
=====
This is a fast disk which supports queue depth of 31.
Following is the output picked from dmesg for my device properties.
[ 3.016099] sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors: (250
GB/232 GiB)
[ 3.016188] sd 2:0:0:0: Attached scsi generic sg2 type 0
Following are the results of 4 runs of your script. (Just changed the
script to read right file on my system if=/mnt/sdb/zerofile1).
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 4.38435 s, 53.4 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.20706 s, 45.0 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.12953 s, 45.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.23573 s, 44.7 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 3.54644 s, 66.0 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.19406 s, 45.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.21908 s, 44.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.23802 s, 44.7 MB/s
Disk2
=====
This is a relatively slower disk with no command queuing.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.06471 s, 33.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.01571 s, 29.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.89043 s, 29.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.03428 s, 29.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.38942 s, 31.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.01146 s, 29.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.78351 s, 30.1 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.06292 s, 29.0 MB/s
Disk3
=====
This is an Intel SSD.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.993735 s, 236 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98772 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.8616 s, 126 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98499 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.01174 s, 231 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.99143 s, 118 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.96132 s, 119 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.97746 s, 118 MB/s
Results without io-throttle patches (vanilla 2.6.30-rc4)
========================================================
Disk 1
======
This is relatively faster SATA drive with command queuing enabled.
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.84065 s, 82.4 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.30087 s, 44.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69688 s, 86.8 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.18175 s, 45.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.73279 s, 85.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.21803 s, 44.9 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69304 s, 87.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.17821 s, 45.2 MB/s
Disk 2
======
Slower disk with no command queuing.
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.29453 s, 54.5 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.04978 s, 29.1 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.96924 s, 59.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.74984 s, 30.2 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.11254 s, 56.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.8678 s, 29.8 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.95979 s, 59.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.73976 s, 30.3 MB/s
Disk3
=====
Intel SSD
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.996762 s, 235 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93268 s, 121 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.98511 s, 238 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.92481 s, 122 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.986981 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.9312 s, 121 MB/s
[root@chilli io-throttle-tests]# ./andrea-test-script.sh
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
So I am still seeing the issue with differnt kind of disks also. At this point
of time I am really not sure why I am seeing such results.
I have following patches applied on 30-rc4 (V16).
3954-vivek.goyal2008-res_counter-introduce-ratelimiting-attributes.patch
3955-vivek.goyal2008-page_cgroup-provide-a-generic-page-tracking-infrastructure.patch
3956-vivek.goyal2008-io-throttle-controller-infrastructure.patch
3957-vivek.goyal2008-kiothrottled-throttle-buffered-io.patch
3958-vivek.goyal2008-io-throttle-instrumentation.patch
3959-vivek.goyal2008-io-throttle-export-per-task-statistics-to-userspace.patch
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 2:33 ` Vivek Goyal
` (3 preceding siblings ...)
2009-05-06 20:32 ` Vivek Goyal
@ 2009-05-07 0:18 ` Ryo Tsuruta
[not found] ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-08 14:24 ` Rik van Riel
4 siblings, 2 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07 0:18 UTC (permalink / raw)
To: vgoyal
Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, peterz
Hi Vivek,
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
I'd like to avoid making complicated existing IO schedulers and other
kernel codes and to give a choice to users whether or not to use it.
I know that you chose an approach that using compile time options to
get the same behavior as old system, but device-mapper drivers can be
added, removed and replaced while system is running.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
2009-05-07 0:18 ` Ryo Tsuruta
@ 2009-05-07 1:25 ` Vivek Goyal
2009-05-08 14:24 ` Rik van Riel
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 1:25 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
>
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
>
Same is possible with IO scheduler based controller. If you don't want
cgroup stuff, don't create those. By default everything will be in root
group and you will get the old behavior.
If you want io controller stuff, just create the cgroup, assign weight
and move task there. So what more choices do you want which are missing
here?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
@ 2009-05-07 1:25 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 1:25 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, peterz
On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
>
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
>
Same is possible with IO scheduler based controller. If you don't want
cgroup stuff, don't create those. By default everything will be in root
group and you will get the old behavior.
If you want io controller stuff, just create the cgroup, assign weight
and move task there. So what more choices do you want which are missing
here?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090507012559.GC4187-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090507012559.GC4187-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 11:23 ` Ryo Tsuruta
0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 11:23 UTC (permalink / raw)
To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Wed, 6 May 2009 21:25:59 -0400
> On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > fairness in terms of actual IO done and that would mean a seeky workload
> > > will can use disk for much longer to get equivalent IO done and slow down
> > > other applications. Implementing IO controller at IO scheduler level gives
> > > us tigher control. Will it not meet your requirements? If you got specific
> > > concerns with IO scheduler based contol patches, please highlight these and
> > > we will see how these can be addressed.
> >
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> >
>
> Same is possible with IO scheduler based controller. If you don't want
> cgroup stuff, don't create those. By default everything will be in root
> group and you will get the old behavior.
>
> If you want io controller stuff, just create the cgroup, assign weight
> and move task there. So what more choices do you want which are missing
> here?
What I mean to say is that device-mapper drivers can be completely
removed from the kernel if not used.
I know that dm-ioband has some issues which can be addressed by your
IO controller, but I'm not sure your controller works well. So I would
like to see some benchmark results of your IO controller.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 1:25 ` Vivek Goyal
(?)
(?)
@ 2009-05-11 11:23 ` Ryo Tsuruta
[not found] ` <20090511.202309.112614168.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
-1 siblings, 1 reply; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 11:23 UTC (permalink / raw)
To: vgoyal
Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
dm-devel, snitzer, m-ikeda, peterz
Hi Vivek,
From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Wed, 6 May 2009 21:25:59 -0400
> On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > fairness in terms of actual IO done and that would mean a seeky workload
> > > will can use disk for much longer to get equivalent IO done and slow down
> > > other applications. Implementing IO controller at IO scheduler level gives
> > > us tigher control. Will it not meet your requirements? If you got specific
> > > concerns with IO scheduler based contol patches, please highlight these and
> > > we will see how these can be addressed.
> >
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> >
>
> Same is possible with IO scheduler based controller. If you don't want
> cgroup stuff, don't create those. By default everything will be in root
> group and you will get the old behavior.
>
> If you want io controller stuff, just create the cgroup, assign weight
> and move task there. So what more choices do you want which are missing
> here?
What I mean to say is that device-mapper drivers can be completely
removed from the kernel if not used.
I know that dm-ioband has some issues which can be addressed by your
IO controller, but I'm not sure your controller works well. So I would
like to see some benchmark results of your IO controller.
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-07 1:25 ` Vivek Goyal
@ 2009-05-08 14:24 ` Rik van Riel
1 sibling, 0 replies; 297+ messages in thread
From: Rik van Riel @ 2009-05-08 14:24 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Ryo Tsuruta wrote:
> Hi Vivek,
>
>> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
>> of FIFO dispatch of buffered bios. Apart from that it tries to provide
>> fairness in terms of actual IO done and that would mean a seeky workload
>> will can use disk for much longer to get equivalent IO done and slow down
>> other applications. Implementing IO controller at IO scheduler level gives
>> us tigher control. Will it not meet your requirements? If you got specific
>> concerns with IO scheduler based contol patches, please highlight these and
>> we will see how these can be addressed.
>
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
I do not believe that every use of cgroups will end up with
a separate logical volume for each group.
In fact, if you look at group-per-UID usage, which could be
quite common on shared web servers and shell servers, I would
expect all the groups to share the same filesystem.
I do not believe dm-ioband would be useful in that configuration,
while the IO scheduler based IO controller will just work.
--
All rights reversed.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-07 0:18 ` Ryo Tsuruta
[not found] ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-05-08 14:24 ` Rik van Riel
[not found] ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-11 10:11 ` Ryo Tsuruta
1 sibling, 2 replies; 297+ messages in thread
From: Rik van Riel @ 2009-05-08 14:24 UTC (permalink / raw)
To: Ryo Tsuruta
Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
agk, dm-devel, snitzer, m-ikeda, peterz
Ryo Tsuruta wrote:
> Hi Vivek,
>
>> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
>> of FIFO dispatch of buffered bios. Apart from that it tries to provide
>> fairness in terms of actual IO done and that would mean a seeky workload
>> will can use disk for much longer to get equivalent IO done and slow down
>> other applications. Implementing IO controller at IO scheduler level gives
>> us tigher control. Will it not meet your requirements? If you got specific
>> concerns with IO scheduler based contol patches, please highlight these and
>> we will see how these can be addressed.
>
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
I do not believe that every use of cgroups will end up with
a separate logical volume for each group.
In fact, if you look at group-per-UID usage, which could be
quite common on shared web servers and shell servers, I would
expect all the groups to share the same filesystem.
I do not believe dm-ioband would be useful in that configuration,
while the IO scheduler based IO controller will just work.
--
All rights reversed.
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 10:11 ` Ryo Tsuruta
0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 10:11 UTC (permalink / raw)
To: riel-H+wXaHxf7aLQT0dZR+AlfA
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Hi Rik,
From: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Fri, 08 May 2009 10:24:50 -0400
> Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> >> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> >> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> >> fairness in terms of actual IO done and that would mean a seeky workload
> >> will can use disk for much longer to get equivalent IO done and slow down
> >> other applications. Implementing IO controller at IO scheduler level gives
> >> us tigher control. Will it not meet your requirements? If you got specific
> >> concerns with IO scheduler based contol patches, please highlight these and
> >> we will see how these can be addressed.
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
>
> I do not believe that every use of cgroups will end up with
> a separate logical volume for each group.
>
> In fact, if you look at group-per-UID usage, which could be
> quite common on shared web servers and shell servers, I would
> expect all the groups to share the same filesystem.
>
> I do not believe dm-ioband would be useful in that configuration,
> while the IO scheduler based IO controller will just work.
dm-ioband can control bandwidth on a per cgroup basis as same as
Vivek's IO controller. Could you explain what do you want to do and
how to configure the IO scheduler based IO controller in that case?
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-08 14:24 ` Rik van Riel
[not found] ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 10:11 ` Ryo Tsuruta
1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 10:11 UTC (permalink / raw)
To: riel
Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
agk, dm-devel, snitzer, m-ikeda, peterz
Hi Rik,
From: Rik van Riel <riel@redhat.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Fri, 08 May 2009 10:24:50 -0400
> Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> >> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> >> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> >> fairness in terms of actual IO done and that would mean a seeky workload
> >> will can use disk for much longer to get equivalent IO done and slow down
> >> other applications. Implementing IO controller at IO scheduler level gives
> >> us tigher control. Will it not meet your requirements? If you got specific
> >> concerns with IO scheduler based contol patches, please highlight these and
> >> we will see how these can be addressed.
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
>
> I do not believe that every use of cgroups will end up with
> a separate logical volume for each group.
>
> In fact, if you look at group-per-UID usage, which could be
> quite common on shared web servers and shell servers, I would
> expect all the groups to share the same filesystem.
>
> I do not believe dm-ioband would be useful in that configuration,
> while the IO scheduler based IO controller will just work.
dm-ioband can control bandwidth on a per cgroup basis as same as
Vivek's IO controller. Could you explain what do you want to do and
how to configure the IO scheduler based IO controller in that case?
Thanks,
Ryo Tsuruta
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-05-05 22:20 ` Peter Zijlstra
2009-05-06 2:33 ` Vivek Goyal
2009-05-06 3:41 ` Balbir Singh
2 siblings, 0 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-05 22:20 UTC (permalink / raw)
To: Andrew Morton
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
> I tend to think that a cgroup-based controller is the way to go.
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.
FWIW I subscribe to the io-scheduler faith as opposed to the
device-mapper cult ;-)
Also, I don't think a simple throttle will be very useful, a more mature
solution should cater to more use cases.
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-05-05 22:20 ` Peter Zijlstra
@ 2009-05-06 2:33 ` Vivek Goyal
2009-05-06 3:41 ` Balbir Singh
2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 2:33 UTC (permalink / raw)
To: Andrew Morton
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
> I tend to think that a cgroup-based controller is the way to go.
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.
Hi Andrew,
Sorry, did not get what do you mean by cgroup based controller? If you
mean that we use cgroups for grouping tasks for controlling IO, then both
IO scheduler based controller as well as io throttling proposal do that.
dm-ioband also supports that up to some extent but it requires extra step of
transferring cgroup grouping information to dm-ioband device using dm-tools.
But if you meant that io-throttle patches, then I think it solves only
part of the problem and that is max bw control. It does not offer minimum
BW/minimum disk share gurantees as offered by proportional BW control.
IOW, it supports upper limit control and does not support a work conserving
IO controller which lets a group use the whole BW if competing groups are
not present. IMHO, proportional BW control is an important feature which
we will need and IIUC, io-throttle patches can't be easily extended to support
proportional BW control, OTOH, one should be able to extend IO scheduler
based proportional weight controller to also support max bw control.
Andrea, last time you were planning to have a look at my patches and see
if max bw controller can be implemented there. I got a feeling that it
should not be too difficult to implement it there. We already have the
hierarchical tree of io queues and groups in elevator layer and we run
BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
just a matter of also keeping track of IO rate per queue/group and we should
be easily be able to delay the dispatch of IO from a queue if its group has
crossed the specified max bw.
This should lead to less code and reduced complextiy (compared with the
case where we do max bw control with io-throttling patches and proportional
BW control using IO scheduler based control patches).
So do you think that it would make sense to do max BW control along with
proportional weight IO controller at IO scheduler? If yes, then we can
work together and continue to develop this patchset to also support max
bw control and meet your requirements and drop the io-throttling patches.
The only thing which concerns me is the fact that IO scheduler does not
have the view of higher level logical device. So if somebody has setup a
software RAID and wants to put max BW limit on software raid device, this
solution will not work. One shall have to live with max bw limits on
individual disks (where io scheduler is actually running). Do your patches
allow to put limit on software RAID devices also?
Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
of FIFO dispatch of buffered bios. Apart from that it tries to provide
fairness in terms of actual IO done and that would mean a seeky workload
will can use disk for much longer to get equivalent IO done and slow down
other applications. Implementing IO controller at IO scheduler level gives
us tigher control. Will it not meet your requirements? If you got specific
concerns with IO scheduler based contol patches, please highlight these and
we will see how these can be addressed.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-05-05 22:20 ` Peter Zijlstra
2009-05-06 2:33 ` Vivek Goyal
@ 2009-05-06 3:41 ` Balbir Singh
2 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 3:41 UTC (permalink / raw)
To: Andrew Morton
Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
* Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [2009-05-05 13:24:41]:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
We are planning an IO mini-summit prior to the kernel summit
(hopefully we'll all be able to attend and decide).
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 20:24 ` Andrew Morton
` (3 preceding siblings ...)
(?)
@ 2009-05-06 3:41 ` Balbir Singh
2009-05-06 13:28 ` Vivek Goyal
[not found] ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
-1 siblings, 2 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 3:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Vivek Goyal, dhaval, snitzer, dm-devel, jens.axboe, agk,
paolo.valente, fernando, jmoyer, fchecconi, containers,
linux-kernel, righi.andrea
* Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:
> On Tue, 5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
> >
> > Hi All,
> >
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> >
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
>
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>
> Seriously, how are we to resolve this? We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>
We are planning an IO mini-summit prior to the kernel summit
(hopefully we'll all be able to attend and decide).
--
Balbir
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-06 3:41 ` Balbir Singh
@ 2009-05-06 13:28 ` Vivek Goyal
[not found] ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
To: Balbir Singh
Cc: Andrew Morton, dhaval, snitzer, dm-devel, jens.axboe, agk,
paolo.valente, fernando, jmoyer, fchecconi, containers,
linux-kernel, righi.andrea
On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:
>
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
>
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).
Hi Balbir,
Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.
Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
@ 2009-05-06 13:28 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
To: Balbir Singh
Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
fchecconi, dm-devel, jens.axboe, Andrew Morton, containers, agk,
righi.andrea
On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:
>
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
>
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).
Hi Balbir,
Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.
Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>]
* Re: IO scheduler based IO Controller V2
[not found] ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 13:28 ` Vivek Goyal
0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
To: Balbir Singh
Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [2009-05-05 13:24:41]:
>
> > On Tue, 5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > >
> > > Hi All,
> > >
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > >
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> >
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> >
> > Seriously, how are we to resolve this? We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
>
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).
Hi Balbir,
Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.
Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (18 preceding siblings ...)
2009-05-05 20:24 ` Andrew Morton
@ 2009-05-06 8:11 ` Gui Jianfeng
2009-05-08 9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06 8:11 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> Hi All,
>
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> First version of the patches was posted here.
Hi Vivek,
I did some simple test for V2, and triggered an kernel panic.
The following script can reproduce this bug. It seems that the cgroup
is already removed, but IO Controller still try to access into it.
#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight
./rwio -w -f 2000M.1 & //do async write
pid1=$!
echo $pid1 > /cgroup/test1/tasks
./rwio -w -f 2000M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks
sleep 10
kill -9 $pid1
kill -9 $pid2
sleep 1
echo ======
cat /cgroup/test1/io.disk_time
cat /cgroup/test2/io.disk_time
echo ======
cat /cgroup/test1/io.disk_sectors
cat /cgroup/test2/io.disk_sectors
rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup
BUG: unable to handle kernel NULL pointer dereferec
IP: [<c0448c24>] cgroup_path+0xc/0x97
*pde = 64d2d067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/md0/range
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
EIP is at cgroup_path+0xc/0x97
EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
Stack:
f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
Call Trace:
[<c04f5389>] ? io_group_path+0x6d/0x89
[<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
[<c04f5340>] ? io_group_path+0x24/0x89
[<c0579fec>] ? ide_build_dmatable+0xda/0x130
[<c042edb4>] ? lock_timer_base+0x19/0x35
[<c042ef0c>] ? mod_timer+0x9f/0xa8
[<c04fdee6>] ? __delay+0x6/0x7
[<c057364f>] ? ide_execute_command+0x5d/0x71
[<c0579d4f>] ? ide_dma_intr+0x0/0x99
[<c0576496>] ? do_rw_taskfile+0x201/0x213
[<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
[<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
[<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
[<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
[<c04e7e67>] ? elv_next_request+0x152/0x15f
[<c04240c2>] ? dequeue_task_fair+0x16/0x2d
[<c0572f49>] ? do_ide_request+0x10f/0x4c8
[<c0642d44>] ? __schedule+0x845/0x893
[<c042edb4>] ? lock_timer_base+0x19/0x35
[<c042f1be>] ? del_timer+0x41/0x47
[<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
[<c04f530d>] ? elv_kick_queue+0x19/0x28
[<c0434b77>] ? worker_thread+0x11f/0x19e
[<c04f52f4>] ? elv_kick_queue+0x0/0x28
[<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
[<c0434a58>] ? worker_thread+0x0/0x19e
[<c0436f3b>] ? kthread+0x42/0x67
[<c0436ef9>] ? kthread+0x0/0x67
[<c040326f>] ? kernel_thread_helper+0x7/0x10
Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
CR2: 000000000000011c
---[ end trace 2d4bc25a2c33e394 ]---
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH] io-controller: Add io group reference handling for request
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (19 preceding siblings ...)
2009-05-06 8:11 ` Gui Jianfeng
@ 2009-05-08 9:45 ` Gui Jianfeng
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08 9:45 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
This patch adds io group reference handling when allocating
and removing a request.
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
elevator-fq.c | 15 ++++++++++++++-
elevator-fq.h | 5 +++++
elevator.c | 2 ++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..e6d6712 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
spin_unlock_irqrestore(q->queue_lock, flags);
BUG_ON(!iog);
- /* Store iog in rq. TODO: take care of referencing */
+ elv_get_iog(iog);
rq->iog = iog;
}
/*
+ * This request has been serviced. Clean up iog info and drop the reference.
+ */
+void elv_fq_unset_request_io_group(struct request *rq)
+{
+ struct io_group *iog = rq->iog;
+
+ if (iog) {
+ rq->iog = NULL;
+ elv_put_iog(iog);
+ }
+}
+
+/*
* Find/Create the io queue the rq should go in. This is an optimization
* for the io schedulers (noop, deadline and AS) which maintain only single
* io queue per cgroup. In this case common layer can just maintain a
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..96a28e9 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
extern int io_group_allow_merge(struct request *rq, struct bio *bio);
extern void elv_fq_set_request_io_group(struct request_queue *q,
struct request *rq, struct bio *bio);
+extern void elv_fq_unset_request_io_group(struct request *rq);
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
return iog->entity.weight;
@@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
{
}
+static inline void elv_fq_unset_request_io_group(struct request *rq)
+{
+}
+
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
/* Just root group is present and weight is immaterial. */
diff --git a/block/elevator.c b/block/elevator.c
index 44c9fad..d75eec7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_unset_request_io_group(rq);
+
/*
* Optimization for noop, deadline and AS which maintain only single
* ioq per io group
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (20 preceding siblings ...)
2009-05-08 9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
@ 2009-05-13 2:00 ` Gui Jianfeng
21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13 2:00 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.
You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.
Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2
Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 11 +++
2 files changed, 245 insertions(+), 5 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..7c95d55 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+ void *key);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+ void *key)
{
struct io_entity *entity = &iog->entity;
+ struct policy_node *pn;
+
+ spin_lock_irq(&iocg->lock);
+ pn = policy_search_node(iocg, key);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->new_weight = pn->weight;
+ entity->ioprio_class = pn->ioprio_class;
+ entity->new_ioprio_class = pn->ioprio_class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->new_weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ entity->new_ioprio_class = iocg->ioprio_class;
+ }
+ spin_unlock_irq(&iocg->lock);
- entity->weight = entity->new_weight = iocg->weight;
- entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
entity->ioprio_changed = 1;
entity->my_sched_data = &iog->sched_data;
}
@@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
atomic_set(&iog->ref, 0);
iog->deleting = 0;
- io_group_init_entity(iocg, iog);
+ io_group_init_entity(iocg, iog, key);
iog->my_entity = &iog->entity;
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
@@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
return iog;
}
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct policy_node *pn;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+
+ if (list_empty(&iocg->list))
+ goto out;
+
+ seq_printf(m, "dev weight class\n");
+
+ spin_lock_irq(&iocg->lock);
+ list_for_each_entry(pn, &iocg->list, node) {
+ seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+ pn->weight, pn->ioprio_class);
+ }
+ spin_unlock_irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+ struct policy_node *pn)
+{
+ list_add(&pn->node, &iocg->list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct policy_node *pn)
+{
+ list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+ void *key)
+{
+ struct policy_node *pn;
+
+ if (list_empty(&iocg->list))
+ return NULL;
+
+ list_for_each_entry(pn, &iocg->list, node) {
+ if (pn->key == key)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static void *devname_to_efqd(const char *buf)
+{
+ struct block_device *bdev;
+ void *key = NULL;
+ struct gendisk *disk;
+ int part;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return NULL;
+
+ disk = get_gendisk(bdev->bd_dev, &part);
+ key = (void *)&disk->queue->elevator->efqd;
+ bdput(bdev);
+
+ return key;
+}
+
+static int policy_parse_and_set(char *buf, struct policy_node *newpn)
+{
+ char *s[3];
+ char *p;
+ int ret;
+ int i = 0;
+
+ memset(s, 0, sizeof(s));
+ while (i < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[i++] = p;
+ }
+
+ newpn->key = devname_to_efqd(s[0]);
+ if (!newpn->key)
+ return -EINVAL;
+
+ strcpy(newpn->dev_name, s[0]);
+
+ ret = strict_strtoul(s[1], 10, &newpn->weight);
+ if (ret || newpn->weight > WEIGHT_MAX)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+ if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+ newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io_cgroup *iocg;
+ struct policy_node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep_newpn = 0;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ buf = kstrdup(buffer, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free_buf;
+ }
+
+ ret = policy_parse_and_set(buf, newpn);
+ if (ret)
+ goto free_newpn;
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto free_newpn;
+ }
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ spin_lock_irq(&iocg->lock);
+
+ pn = policy_search_node(iocg, newpn->key);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy_insert_node(iocg, newpn);
+ keep_newpn = 1;
+ }
+ goto update_io_group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy_delete_node(pn);
+ goto update_io_group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+ if (iog->key == newpn->key) {
+ if (newpn->weight) {
+ iog->entity.new_weight = newpn->weight;
+ iog->entity.new_ioprio_class =
+ newpn->ioprio_class;
+ /*
+ * iog weight and ioprio_class updating
+ * actually happens if ioprio_changed is set.
+ * So ensure ioprio_changed is not set until
+ * new weight and new ioprio_class are updated.
+ */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ } else {
+ iog->entity.new_weight = iocg->weight;
+ iog->entity.new_ioprio_class =
+ iocg->ioprio_class;
+
+ /* The same as above */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ }
+ }
+ }
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+free_newpn:
+ if (!keep_newpn)
+ kfree(newpn);
+free_buf:
+ kfree(buf);
+ return ret;
+}
+
struct cftype bfqio_files[] = {
{
+ .name = "policy",
+ .read_seq_string = io_cgroup_policy_read,
+ .write_string = io_cgroup_policy_write,
+ .max_write_len = 256,
+ },
+ {
.name = "weight",
.read_u64 = io_cgroup_weight_read,
.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
INIT_HLIST_HEAD(&iocg->group_data);
iocg->weight = IO_DEFAULT_GRP_WEIGHT;
iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+ INIT_LIST_HEAD(&iocg->list);
return &iocg->css;
}
@@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
unsigned long flags, flags1;
int queue_lock_held = 0;
struct elv_fq_data *efqd;
+ struct policy_node *pn, *pntmp;
/*
* io groups are linked in two lists. One list is maintained
@@ -1823,6 +2046,12 @@ locked:
BUG_ON(!hlist_empty(&iocg->group_data));
free_css_id(&io_subsys, &iocg->css);
+
+ list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
+ policy_delete_node(pn);
+ kfree(pn);
+ }
+
kfree(iocg);
}
@@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
{
entity->ioprio = entity->new_ioprio;
- entity->weight = entity->new_weight;
+ entity->weight = entity->new_weigh;
entity->ioprio_class = entity->new_ioprio_class;
entity->sched_data = &iog->sched_data;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..0407633 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -253,6 +253,14 @@ struct io_group {
#endif
};
+struct policy_node {
+ struct list_head node;
+ char dev_name[32];
+ void *key;
+ unsigned long weight;
+ unsigned long ioprio_class;
+};
+
/**
* struct bfqio_cgroup - bfq cgroup data structure.
* @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +277,9 @@ struct io_cgroup {
unsigned long weight, ioprio_class;
+ /* list of policy_node */
+ struct list_head list;
+
spinlock_t lock;
struct hlist_head group_data;
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: IO scheduler based IO Controller V2
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (34 preceding siblings ...)
[not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 8:11 ` Gui Jianfeng
[not found] ` <4A014619.1040000-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-08 9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
37 siblings, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06 8:11 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> Hi All,
>
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> First version of the patches was posted here.
Hi Vivek,
I did some simple test for V2, and triggered an kernel panic.
The following script can reproduce this bug. It seems that the cgroup
is already removed, but IO Controller still try to access into it.
#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight
./rwio -w -f 2000M.1 & //do async write
pid1=$!
echo $pid1 > /cgroup/test1/tasks
./rwio -w -f 2000M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks
sleep 10
kill -9 $pid1
kill -9 $pid2
sleep 1
echo ======
cat /cgroup/test1/io.disk_time
cat /cgroup/test2/io.disk_time
echo ======
cat /cgroup/test1/io.disk_sectors
cat /cgroup/test2/io.disk_sectors
rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup
BUG: unable to handle kernel NULL pointer dereferec
IP: [<c0448c24>] cgroup_path+0xc/0x97
*pde = 64d2d067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/md0/range
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
EIP is at cgroup_path+0xc/0x97
EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
Stack:
f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
Call Trace:
[<c04f5389>] ? io_group_path+0x6d/0x89
[<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
[<c04f5340>] ? io_group_path+0x24/0x89
[<c0579fec>] ? ide_build_dmatable+0xda/0x130
[<c042edb4>] ? lock_timer_base+0x19/0x35
[<c042ef0c>] ? mod_timer+0x9f/0xa8
[<c04fdee6>] ? __delay+0x6/0x7
[<c057364f>] ? ide_execute_command+0x5d/0x71
[<c0579d4f>] ? ide_dma_intr+0x0/0x99
[<c0576496>] ? do_rw_taskfile+0x201/0x213
[<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
[<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
[<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
[<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
[<c04e7e67>] ? elv_next_request+0x152/0x15f
[<c04240c2>] ? dequeue_task_fair+0x16/0x2d
[<c0572f49>] ? do_ide_request+0x10f/0x4c8
[<c0642d44>] ? __schedule+0x845/0x893
[<c042edb4>] ? lock_timer_base+0x19/0x35
[<c042f1be>] ? del_timer+0x41/0x47
[<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
[<c04f530d>] ? elv_kick_queue+0x19/0x28
[<c0434b77>] ? worker_thread+0x11f/0x19e
[<c04f52f4>] ? elv_kick_queue+0x0/0x28
[<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
[<c0434a58>] ? worker_thread+0x0/0x19e
[<c0436f3b>] ? kthread+0x42/0x67
[<c0436ef9>] ? kthread+0x0/0x67
[<c040326f>] ? kernel_thread_helper+0x7/0x10
Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
CR2: 000000000000011c
---[ end trace 2d4bc25a2c33e394 ]---
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* [PATCH] io-controller: Add io group reference handling for request
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (35 preceding siblings ...)
2009-05-06 8:11 ` IO scheduler based IO Controller V2 Gui Jianfeng
@ 2009-05-08 9:45 ` Gui Jianfeng
[not found] ` <4A03FF3C.4020506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
37 siblings, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08 9:45 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Hi Vivek,
This patch adds io group reference handling when allocating
and removing a request.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
elevator-fq.c | 15 ++++++++++++++-
elevator-fq.h | 5 +++++
elevator.c | 2 ++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..e6d6712 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
spin_unlock_irqrestore(q->queue_lock, flags);
BUG_ON(!iog);
- /* Store iog in rq. TODO: take care of referencing */
+ elv_get_iog(iog);
rq->iog = iog;
}
/*
+ * This request has been serviced. Clean up iog info and drop the reference.
+ */
+void elv_fq_unset_request_io_group(struct request *rq)
+{
+ struct io_group *iog = rq->iog;
+
+ if (iog) {
+ rq->iog = NULL;
+ elv_put_iog(iog);
+ }
+}
+
+/*
* Find/Create the io queue the rq should go in. This is an optimization
* for the io schedulers (noop, deadline and AS) which maintain only single
* io queue per cgroup. In this case common layer can just maintain a
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..96a28e9 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
extern int io_group_allow_merge(struct request *rq, struct bio *bio);
extern void elv_fq_set_request_io_group(struct request_queue *q,
struct request *rq, struct bio *bio);
+extern void elv_fq_unset_request_io_group(struct request *rq);
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
return iog->entity.weight;
@@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
{
}
+static inline void elv_fq_unset_request_io_group(struct request *rq)
+{
+}
+
static inline bfq_weight_t iog_weight(struct io_group *iog)
{
/* Just root group is present and weight is immaterial. */
diff --git a/block/elevator.c b/block/elevator.c
index 44c9fad..d75eec7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ elv_fq_unset_request_io_group(rq);
+
/*
* Optimization for noop, deadline and AS which maintain only single
* ioq per io group
^ permalink raw reply related [flat|nested] 297+ messages in thread
* [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
` (36 preceding siblings ...)
2009-05-08 9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
@ 2009-05-13 2:00 ` Gui Jianfeng
2009-05-13 14:44 ` Vivek Goyal
` (5 more replies)
37 siblings, 6 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13 2:00 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Hi Vivek,
This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.
You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.
Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2
Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 11 +++
2 files changed, 245 insertions(+), 5 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..7c95d55 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+ void *key);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+ void *key)
{
struct io_entity *entity = &iog->entity;
+ struct policy_node *pn;
+
+ spin_lock_irq(&iocg->lock);
+ pn = policy_search_node(iocg, key);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->new_weight = pn->weight;
+ entity->ioprio_class = pn->ioprio_class;
+ entity->new_ioprio_class = pn->ioprio_class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->new_weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ entity->new_ioprio_class = iocg->ioprio_class;
+ }
+ spin_unlock_irq(&iocg->lock);
- entity->weight = entity->new_weight = iocg->weight;
- entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
entity->ioprio_changed = 1;
entity->my_sched_data = &iog->sched_data;
}
@@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
atomic_set(&iog->ref, 0);
iog->deleting = 0;
- io_group_init_entity(iocg, iog);
+ io_group_init_entity(iocg, iog, key);
iog->my_entity = &iog->entity;
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
@@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
return iog;
}
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct policy_node *pn;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+
+ if (list_empty(&iocg->list))
+ goto out;
+
+ seq_printf(m, "dev weight class\n");
+
+ spin_lock_irq(&iocg->lock);
+ list_for_each_entry(pn, &iocg->list, node) {
+ seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+ pn->weight, pn->ioprio_class);
+ }
+ spin_unlock_irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+ struct policy_node *pn)
+{
+ list_add(&pn->node, &iocg->list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct policy_node *pn)
+{
+ list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+ void *key)
+{
+ struct policy_node *pn;
+
+ if (list_empty(&iocg->list))
+ return NULL;
+
+ list_for_each_entry(pn, &iocg->list, node) {
+ if (pn->key == key)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static void *devname_to_efqd(const char *buf)
+{
+ struct block_device *bdev;
+ void *key = NULL;
+ struct gendisk *disk;
+ int part;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return NULL;
+
+ disk = get_gendisk(bdev->bd_dev, &part);
+ key = (void *)&disk->queue->elevator->efqd;
+ bdput(bdev);
+
+ return key;
+}
+
+static int policy_parse_and_set(char *buf, struct policy_node *newpn)
+{
+ char *s[3];
+ char *p;
+ int ret;
+ int i = 0;
+
+ memset(s, 0, sizeof(s));
+ while (i < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[i++] = p;
+ }
+
+ newpn->key = devname_to_efqd(s[0]);
+ if (!newpn->key)
+ return -EINVAL;
+
+ strcpy(newpn->dev_name, s[0]);
+
+ ret = strict_strtoul(s[1], 10, &newpn->weight);
+ if (ret || newpn->weight > WEIGHT_MAX)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+ if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+ newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io_cgroup *iocg;
+ struct policy_node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep_newpn = 0;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ buf = kstrdup(buffer, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free_buf;
+ }
+
+ ret = policy_parse_and_set(buf, newpn);
+ if (ret)
+ goto free_newpn;
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto free_newpn;
+ }
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ spin_lock_irq(&iocg->lock);
+
+ pn = policy_search_node(iocg, newpn->key);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy_insert_node(iocg, newpn);
+ keep_newpn = 1;
+ }
+ goto update_io_group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy_delete_node(pn);
+ goto update_io_group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+ if (iog->key == newpn->key) {
+ if (newpn->weight) {
+ iog->entity.new_weight = newpn->weight;
+ iog->entity.new_ioprio_class =
+ newpn->ioprio_class;
+ /*
+ * iog weight and ioprio_class updating
+ * actually happens if ioprio_changed is set.
+ * So ensure ioprio_changed is not set until
+ * new weight and new ioprio_class are updated.
+ */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ } else {
+ iog->entity.new_weight = iocg->weight;
+ iog->entity.new_ioprio_class =
+ iocg->ioprio_class;
+
+ /* The same as above */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ }
+ }
+ }
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+free_newpn:
+ if (!keep_newpn)
+ kfree(newpn);
+free_buf:
+ kfree(buf);
+ return ret;
+}
+
struct cftype bfqio_files[] = {
{
+ .name = "policy",
+ .read_seq_string = io_cgroup_policy_read,
+ .write_string = io_cgroup_policy_write,
+ .max_write_len = 256,
+ },
+ {
.name = "weight",
.read_u64 = io_cgroup_weight_read,
.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
INIT_HLIST_HEAD(&iocg->group_data);
iocg->weight = IO_DEFAULT_GRP_WEIGHT;
iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+ INIT_LIST_HEAD(&iocg->list);
return &iocg->css;
}
@@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
unsigned long flags, flags1;
int queue_lock_held = 0;
struct elv_fq_data *efqd;
+ struct policy_node *pn, *pntmp;
/*
* io groups are linked in two lists. One list is maintained
@@ -1823,6 +2046,12 @@ locked:
BUG_ON(!hlist_empty(&iocg->group_data));
free_css_id(&io_subsys, &iocg->css);
+
+ list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
+ policy_delete_node(pn);
+ kfree(pn);
+ }
+
kfree(iocg);
}
@@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
{
entity->ioprio = entity->new_ioprio;
- entity->weight = entity->new_weight;
+ entity->weight = entity->new_weigh;
entity->ioprio_class = entity->new_ioprio_class;
entity->sched_data = &iog->sched_data;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..0407633 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -253,6 +253,14 @@ struct io_group {
#endif
};
+struct policy_node {
+ struct list_head node;
+ char dev_name[32];
+ void *key;
+ unsigned long weight;
+ unsigned long ioprio_class;
+};
+
/**
* struct bfqio_cgroup - bfq cgroup data structure.
* @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +277,9 @@ struct io_cgroup {
unsigned long weight, ioprio_class;
+ /* list of policy_node */
+ struct list_head list;
+
spinlock_t lock;
struct hlist_head group_data;
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
@ 2009-05-13 14:44 ` Vivek Goyal
[not found] ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 0:59 ` Gui Jianfeng
2009-05-13 15:29 ` Vivek Goyal
` (4 subsequent siblings)
5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:44 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
Thanks for the patch Gui. I will test it out and let you know how does
it go.
Thanks
Vivek
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> + io_group_init_entity(iocg, iog, key);
> iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key)
> +{
> + struct policy_node *pn;
> +
> + if (list_empty(&iocg->list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->list, node) {
> + if (pn->key == key)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> + struct block_device *bdev;
> + void *key = NULL;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return NULL;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + key = (void *)&disk->queue->elevator->efqd;
> + bdput(bdev);
> +
> + return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> + char *s[3];
> + char *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while (i < ARRAY_SIZE(s)) {
> + p = strsep(&buf, ":");
> + if (!p)
> + break;
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + newpn->key = devname_to_efqd(s[0]);
> + if (!newpn->key)
> + return -EINVAL;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->key);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->key == newpn->key) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 0:59 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 0:59 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>
> Thanks for the patch Gui. I will test it out and let you know how does
> it go.
Hi Vivek,
I forgot to mention that this patch isn't tested through, just for showing
the design. I'd like to test it soon.
>
> Thanks
> Vivek
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 14:44 ` Vivek Goyal
[not found] ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 0:59 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 0:59 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>
> Thanks for the patch Gui. I will test it out and let you know how does
> it go.
Hi Vivek,
I forgot to mention that this patch isn't tested through, just for showing
the design. I'd like to test it soon.
>
> Thanks
> Vivek
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
2009-05-13 14:44 ` Vivek Goyal
@ 2009-05-13 15:29 ` Vivek Goyal
2009-05-14 1:02 ` Gui Jianfeng
[not found] ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-13 15:59 ` Vivek Goyal
` (3 subsequent siblings)
5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:29 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
[..]
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
Would "io_policy_node" be better?
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
How about "struct list_head policy_list" or "struct list_head io_policy"?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 15:29 ` Vivek Goyal
@ 2009-05-14 1:02 ` Gui Jianfeng
[not found] ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:02 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>
> [..]
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>> {
>> entity->ioprio = entity->new_ioprio;
>> - entity->weight = entity->new_weight;
>> + entity->weight = entity->new_weigh;
>> entity->ioprio_class = entity->new_ioprio_class;
>> entity->sched_data = &iog->sched_data;
>> }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>> #endif
>> };
>>
>> +struct policy_node {
>
> Would "io_policy_node" be better?
Sure
>
>> + struct list_head node;
>> + char dev_name[32];
>> + void *key;
>> + unsigned long weight;
>> + unsigned long ioprio_class;
>> +};
>> +
>> /**
>> * struct bfqio_cgroup - bfq cgroup data structure.
>> * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>
>> unsigned long weight, ioprio_class;
>>
>> + /* list of policy_node */
>> + struct list_head list;
>> +
>
> How about "struct list_head policy_list" or "struct list_head io_policy"?
OK
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 1:02 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:02 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>
> [..]
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>> {
>> entity->ioprio = entity->new_ioprio;
>> - entity->weight = entity->new_weight;
>> + entity->weight = entity->new_weigh;
>> entity->ioprio_class = entity->new_ioprio_class;
>> entity->sched_data = &iog->sched_data;
>> }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>> #endif
>> };
>>
>> +struct policy_node {
>
> Would "io_policy_node" be better?
Sure
>
>> + struct list_head node;
>> + char dev_name[32];
>> + void *key;
>> + unsigned long weight;
>> + unsigned long ioprio_class;
>> +};
>> +
>> /**
>> * struct bfqio_cgroup - bfq cgroup data structure.
>> * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>
>> unsigned long weight, ioprio_class;
>>
>> + /* list of policy_node */
>> + struct list_head list;
>> +
>
> How about "struct list_head policy_list" or "struct list_head io_policy"?
OK
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
2009-05-13 14:44 ` Vivek Goyal
2009-05-13 15:29 ` Vivek Goyal
@ 2009-05-13 15:59 ` Vivek Goyal
2009-05-14 1:51 ` Gui Jianfeng
` (2 more replies)
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
` (2 subsequent siblings)
5 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:59 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
Hi Gui,
It might make sense to also store the device name or device major and
minor number in io_group while creating the io group. This will help us
to display io.disk_time and io.disk_sector statistics per device instead
of aggregate.
I am attaching a patch I was playing around with to display per device
statistics instead of aggregate one. So if user has specified the per
device rule.
Thanks
Vivek
o Currently the statistics exported through cgroup are aggregate of statistics
on all devices for that cgroup. Instead of aggregate, make these per device.
o Also export another statistics io.disk_dequeue. This keeps a count of how
many times a particular group got out of race for the disk. This is a
debugging aid to keep a track how often we could create continuously
backlogged queues.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
block/elevator-fq.c | 127 +++++++++++++++++++++++++++++++++-------------------
block/elevator-fq.h | 3 +
2 files changed, 85 insertions(+), 45 deletions(-)
Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h 2009-05-13 11:40:32.000000000 -0400
+++ linux14/block/elevator-fq.h 2009-05-13 11:40:57.000000000 -0400
@@ -250,6 +250,9 @@ struct io_group {
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
unsigned short iocg_id;
+ dev_t dev;
+ /* How many times this group has been removed from active tree */
+ unsigned long dequeue;
#endif
};
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c 2009-05-13 11:40:53.000000000 -0400
+++ linux14/block/elevator-fq.c 2009-05-13 11:40:57.000000000 -0400
@@ -12,6 +12,7 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
BUG_ON(sd->active_entity == entity);
BUG_ON(sd->next_active == entity);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ struct io_group *iog = io_entity_to_iog(entity);
+ /*
+ * Keep track of how many times a group has been removed
+ * from active tree because it did not have any active
+ * backlogged ioq under it
+ */
+ if (iog)
+ iog->dequeue++;
+ }
+#endif
return ret;
}
@@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_time = 0;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
rcu_read_lock();
+ spin_lock_irq(&iocg->lock);
hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
/*
* There might be groups which are not functional and
* waiting to be reclaimed upon cgoup deletion.
*/
- if (rcu_dereference(iog->key))
- disk_time += iog->entity.total_service;
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_service);
+ }
}
+ spin_unlock_irq(&iocg->lock);
rcu_read_unlock();
- return disk_time;
+ cgroup_unlock();
+
+ return 0;
}
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
- struct cftype *cftype)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
struct io_cgroup *iocg;
- u64 ret;
+ struct io_group *iog;
+ struct hlist_node *n;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
- spin_lock_irq(&iocg->lock);
- ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
- spin_unlock_irq(&iocg->lock);
-
- cgroup_unlock();
-
- return ret;
-}
-
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
-{
- struct io_group *iog;
- struct hlist_node *n;
- u64 disk_sectors = 0;
rcu_read_lock();
+ spin_lock_irq(&iocg->lock);
hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
/*
* There might be groups which are not functional and
* waiting to be reclaimed upon cgoup deletion.
*/
- if (rcu_dereference(iog->key))
- disk_sectors += iog->entity.total_sector_service;
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_sector_service);
+ }
}
+ spin_unlock_irq(&iocg->lock);
rcu_read_unlock();
- return disk_sectors;
+ cgroup_unlock();
+
+ return 0;
}
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
- struct cftype *cftype)
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
- struct io_cgroup *iocg;
- u64 ret;
+ struct io_cgroup *iocg = NULL;
+ struct io_group *iog = NULL;
+ struct hlist_node *n;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
+
+ rcu_read_lock();
spin_lock_irq(&iocg->lock);
- ret = calculate_aggr_disk_sectors(iocg);
+ /* Loop through all the io groups and print statistics */
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev), iog->dequeue);
+ }
+ }
spin_unlock_irq(&iocg->lock);
+ rcu_read_unlock();
cgroup_unlock();
- return ret;
+ return 0;
}
/**
@@ -1222,7 +1248,7 @@ static u64 io_cgroup_disk_sectors_read(s
* to the root has already an allocated group on @bfqd.
*/
struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
- struct cgroup *cgroup)
+ struct cgroup *cgroup, struct bio *bio)
{
struct io_cgroup *iocg;
struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1250,8 +1276,13 @@ struct io_group *io_group_chain_alloc(st
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
+ if (bio) {
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ iog->dev = MKDEV(disk->major, disk->first_minor);
+ }
#endif
blk_init_request_list(&iog->rl);
@@ -1364,7 +1395,7 @@ void io_group_chain_link(struct request_
*/
struct io_group *io_find_alloc_group(struct request_queue *q,
struct cgroup *cgroup, struct elv_fq_data *efqd,
- int create)
+ int create, struct bio *bio)
{
struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
struct io_group *iog = NULL;
@@ -1375,7 +1406,7 @@ struct io_group *io_find_alloc_group(str
if (iog != NULL || !create)
return iog;
- iog = io_group_chain_alloc(q, key, cgroup);
+ iog = io_group_chain_alloc(q, key, cgroup, bio);
if (iog != NULL)
io_group_chain_link(q, key, cgroup, iog, efqd);
@@ -1481,7 +1512,7 @@ struct io_group *io_get_io_group(struct
goto out;
}
- iog = io_find_alloc_group(q, cgroup, efqd, create);
+ iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
if (!iog) {
if (create)
iog = efqd->root_group;
@@ -1554,12 +1585,18 @@ struct cftype bfqio_files[] = {
},
{
.name = "disk_time",
- .read_u64 = io_cgroup_disk_time_read,
+ .read_seq_string = io_cgroup_disk_time_read,
},
{
.name = "disk_sectors",
- .read_u64 = io_cgroup_disk_sectors_read,
+ .read_seq_string = io_cgroup_disk_sectors_read,
},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ .name = "disk_dequeue",
+ .read_seq_string = io_cgroup_disk_dequeue_read,
+ },
+#endif
};
int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 15:59 ` Vivek Goyal
@ 2009-05-14 1:51 ` Gui Jianfeng
[not found] ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 2:25 ` Gui Jianfeng
2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:51 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
>
> Hi Gui,
>
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
>
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
>
> Thanks
> Vivek
>
>
> o Currently the statistics exported through cgroup are aggregate of statistics
> on all devices for that cgroup. Instead of aggregate, make these per device.
Hi Vivek,
Actually, I did it also.
FYI
Examples:
# cat io.disk_time
dev:/dev/hdb time:4421
dev:others time:3741
# cat io.disk_sectors
dev:/dev/hdb sectors:585696
dev:others sectors:2664
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/elevator-fq.c | 104 +++++++++++++++++++++++---------------------------
1 files changed, 48 insertions(+), 56 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c95d55..1620074 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1162,90 +1162,82 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype,
+ struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_time = 0;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
- /*
- * There might be groups which are not functional and
- * waiting to be reclaimed upon cgoup deletion.
- */
- if (rcu_dereference(iog->key))
- disk_time += iog->entity.total_service;
- }
- rcu_read_unlock();
-
- return disk_time;
-}
+ struct policy_node *pn;
+ unsigned int other, time;
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
- struct cftype *cftype)
-{
- struct io_cgroup *iocg;
- u64 ret;
+ other = 0;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
spin_lock_irq(&iocg->lock);
- ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ if (iog->key != NULL) {
+ pn = policy_search_node(iocg, iog->key);
+ if (pn) {
+ time = jiffies_to_msecs(iog->entity.
+ total_service);
+ seq_printf(m, "dev:%s time:%u\n",
+ pn->dev_name, time);
+ } else {
+ other += jiffies_to_msecs(iog->entity.
+ total_service);
+ }
+ }
+ }
+ seq_printf(m, "dev:others time:%u\n", other);
+
spin_unlock_irq(&iocg->lock);
cgroup_unlock();
- return ret;
+ return 0;
}
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype,
+ struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_sectors = 0;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
- /*
- * There might be groups which are not functional and
- * waiting to be reclaimed upon cgoup deletion.
- */
- if (rcu_dereference(iog->key))
- disk_sectors += iog->entity.total_sector_service;
- }
- rcu_read_unlock();
+ struct policy_node *pn;
+ u64 other = 0;
- return disk_sectors;
-}
-
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
- struct cftype *cftype)
-{
- struct io_cgroup *iocg;
- u64 ret;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
spin_lock_irq(&iocg->lock);
- ret = calculate_aggr_disk_sectors(iocg);
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ if (iog->key) {
+ pn = policy_search_node(iocg, iog->key);
+ if (pn) {
+ seq_printf(m, "dev:%s sectors:%lu\n",
+ pn->dev_name,
+ iog->entity.total_sector_service);
+ } else {
+ other += iog->entity.total_sector_service;
+ }
+ }
+ }
+
+ seq_printf(m, "dev:others sectors:%llu\n", other);
+
spin_unlock_irq(&iocg->lock);
cgroup_unlock();
- return ret;
+ return 0;
}
/**
@@ -1783,11 +1775,11 @@ struct cftype bfqio_files[] = {
},
{
.name = "disk_time",
- .read_u64 = io_cgroup_disk_time_read,
+ .read_seq_string = io_cgroup_disk_time_read,
},
{
.name = "disk_sectors",
- .read_u64 = io_cgroup_disk_sectors_read,
+ .read_seq_string = io_cgroup_disk_sectors_read,
},
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 1:51 ` Gui Jianfeng
2009-05-14 2:25 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:51 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
>
> Hi Gui,
>
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
>
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
>
> Thanks
> Vivek
>
>
> o Currently the statistics exported through cgroup are aggregate of statistics
> on all devices for that cgroup. Instead of aggregate, make these per device.
Hi Vivek,
Actually, I did it also.
FYI
Examples:
# cat io.disk_time
dev:/dev/hdb time:4421
dev:others time:3741
# cat io.disk_sectors
dev:/dev/hdb sectors:585696
dev:others sectors:2664
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
block/elevator-fq.c | 104 +++++++++++++++++++++++---------------------------
1 files changed, 48 insertions(+), 56 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c95d55..1620074 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1162,90 +1162,82 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype,
+ struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_time = 0;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
- /*
- * There might be groups which are not functional and
- * waiting to be reclaimed upon cgoup deletion.
- */
- if (rcu_dereference(iog->key))
- disk_time += iog->entity.total_service;
- }
- rcu_read_unlock();
-
- return disk_time;
-}
+ struct policy_node *pn;
+ unsigned int other, time;
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
- struct cftype *cftype)
-{
- struct io_cgroup *iocg;
- u64 ret;
+ other = 0;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
spin_lock_irq(&iocg->lock);
- ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ if (iog->key != NULL) {
+ pn = policy_search_node(iocg, iog->key);
+ if (pn) {
+ time = jiffies_to_msecs(iog->entity.
+ total_service);
+ seq_printf(m, "dev:%s time:%u\n",
+ pn->dev_name, time);
+ } else {
+ other += jiffies_to_msecs(iog->entity.
+ total_service);
+ }
+ }
+ }
+ seq_printf(m, "dev:others time:%u\n", other);
+
spin_unlock_irq(&iocg->lock);
cgroup_unlock();
- return ret;
+ return 0;
}
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype,
+ struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_sectors = 0;
-
- rcu_read_lock();
- hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
- /*
- * There might be groups which are not functional and
- * waiting to be reclaimed upon cgoup deletion.
- */
- if (rcu_dereference(iog->key))
- disk_sectors += iog->entity.total_sector_service;
- }
- rcu_read_unlock();
+ struct policy_node *pn;
+ u64 other = 0;
- return disk_sectors;
-}
-
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
- struct cftype *cftype)
-{
- struct io_cgroup *iocg;
- u64 ret;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
spin_lock_irq(&iocg->lock);
- ret = calculate_aggr_disk_sectors(iocg);
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ if (iog->key) {
+ pn = policy_search_node(iocg, iog->key);
+ if (pn) {
+ seq_printf(m, "dev:%s sectors:%lu\n",
+ pn->dev_name,
+ iog->entity.total_sector_service);
+ } else {
+ other += iog->entity.total_sector_service;
+ }
+ }
+ }
+
+ seq_printf(m, "dev:others sectors:%llu\n", other);
+
spin_unlock_irq(&iocg->lock);
cgroup_unlock();
- return ret;
+ return 0;
}
/**
@@ -1783,11 +1775,11 @@ struct cftype bfqio_files[] = {
},
{
.name = "disk_time",
- .read_u64 = io_cgroup_disk_time_read,
+ .read_seq_string = io_cgroup_disk_time_read,
},
{
.name = "disk_sectors",
- .read_u64 = io_cgroup_disk_sectors_read,
+ .read_seq_string = io_cgroup_disk_sectors_read,
},
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 1:51 ` Gui Jianfeng
@ 2009-05-14 2:25 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 2:25 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
> Hi Gui,
>
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
>
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
>
> Thanks
> Vivek
>
>
> o Currently the statistics exported through cgroup are aggregate of statistics
> on all devices for that cgroup. Instead of aggregate, make these per device.
>
> o Also export another statistics io.disk_dequeue. This keeps a count of how
> many times a particular group got out of race for the disk. This is a
> debugging aid to keep a track how often we could create continuously
> backlogged queues.
>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
> block/elevator-fq.c | 127 +++++++++++++++++++++++++++++++++-------------------
> block/elevator-fq.h | 3 +
> 2 files changed, 85 insertions(+), 45 deletions(-)
>
> Index: linux14/block/elevator-fq.h
> ===================================================================
> --- linux14.orig/block/elevator-fq.h 2009-05-13 11:40:32.000000000 -0400
> +++ linux14/block/elevator-fq.h 2009-05-13 11:40:57.000000000 -0400
> @@ -250,6 +250,9 @@ struct io_group {
>
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> unsigned short iocg_id;
> + dev_t dev;
> + /* How many times this group has been removed from active tree */
> + unsigned long dequeue;
> #endif
> };
>
> Index: linux14/block/elevator-fq.c
> ===================================================================
> --- linux14.orig/block/elevator-fq.c 2009-05-13 11:40:53.000000000 -0400
> +++ linux14/block/elevator-fq.c 2009-05-13 11:40:57.000000000 -0400
> @@ -12,6 +12,7 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
> BUG_ON(sd->active_entity == entity);
> BUG_ON(sd->next_active == entity);
>
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> + {
> + struct io_group *iog = io_entity_to_iog(entity);
> + /*
> + * Keep track of how many times a group has been removed
> + * from active tree because it did not have any active
> + * backlogged ioq under it
> + */
> + if (iog)
> + iog->dequeue++;
> + }
> +#endif
> return ret;
> }
>
> @@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
> #undef STORE_FUNCTION
>
> -/*
> - * traverse through all the io_groups associated with this cgroup and calculate
> - * the aggr disk time received by all the groups on respective disks.
> - */
> -static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> + struct cftype *cftype, struct seq_file *m)
> {
> + struct io_cgroup *iocg;
> struct io_group *iog;
> struct hlist_node *n;
> - u64 disk_time = 0;
> +
> + if (!cgroup_lock_live_group(cgroup))
> + return -ENODEV;
> +
> + iocg = cgroup_to_io_cgroup(cgroup);
>
> rcu_read_lock();
> + spin_lock_irq(&iocg->lock);
> hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> /*
> * There might be groups which are not functional and
> * waiting to be reclaimed upon cgoup deletion.
> */
> - if (rcu_dereference(iog->key))
> - disk_time += iog->entity.total_service;
> + if (rcu_dereference(iog->key)) {
> + seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> + MINOR(iog->dev),
> + iog->entity.total_service);
Hi Vivek,
I think it's easier for users if device name is also shown here.
> + }
> }
> + spin_unlock_irq(&iocg->lock);
> rcu_read_unlock();
>
> - return disk_time;
> + cgroup_unlock();
> +
> + return 0;
> }
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 15:59 ` Vivek Goyal
2009-05-14 1:51 ` Gui Jianfeng
[not found] ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 2:25 ` Gui Jianfeng
2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 2:25 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
> Hi Gui,
>
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
>
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
>
> Thanks
> Vivek
>
>
> o Currently the statistics exported through cgroup are aggregate of statistics
> on all devices for that cgroup. Instead of aggregate, make these per device.
>
> o Also export another statistics io.disk_dequeue. This keeps a count of how
> many times a particular group got out of race for the disk. This is a
> debugging aid to keep a track how often we could create continuously
> backlogged queues.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
> block/elevator-fq.c | 127 +++++++++++++++++++++++++++++++++-------------------
> block/elevator-fq.h | 3 +
> 2 files changed, 85 insertions(+), 45 deletions(-)
>
> Index: linux14/block/elevator-fq.h
> ===================================================================
> --- linux14.orig/block/elevator-fq.h 2009-05-13 11:40:32.000000000 -0400
> +++ linux14/block/elevator-fq.h 2009-05-13 11:40:57.000000000 -0400
> @@ -250,6 +250,9 @@ struct io_group {
>
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> unsigned short iocg_id;
> + dev_t dev;
> + /* How many times this group has been removed from active tree */
> + unsigned long dequeue;
> #endif
> };
>
> Index: linux14/block/elevator-fq.c
> ===================================================================
> --- linux14.orig/block/elevator-fq.c 2009-05-13 11:40:53.000000000 -0400
> +++ linux14/block/elevator-fq.c 2009-05-13 11:40:57.000000000 -0400
> @@ -12,6 +12,7 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
> BUG_ON(sd->active_entity == entity);
> BUG_ON(sd->next_active == entity);
>
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> + {
> + struct io_group *iog = io_entity_to_iog(entity);
> + /*
> + * Keep track of how many times a group has been removed
> + * from active tree because it did not have any active
> + * backlogged ioq under it
> + */
> + if (iog)
> + iog->dequeue++;
> + }
> +#endif
> return ret;
> }
>
> @@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
> #undef STORE_FUNCTION
>
> -/*
> - * traverse through all the io_groups associated with this cgroup and calculate
> - * the aggr disk time received by all the groups on respective disks.
> - */
> -static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> + struct cftype *cftype, struct seq_file *m)
> {
> + struct io_cgroup *iocg;
> struct io_group *iog;
> struct hlist_node *n;
> - u64 disk_time = 0;
> +
> + if (!cgroup_lock_live_group(cgroup))
> + return -ENODEV;
> +
> + iocg = cgroup_to_io_cgroup(cgroup);
>
> rcu_read_lock();
> + spin_lock_irq(&iocg->lock);
> hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> /*
> * There might be groups which are not functional and
> * waiting to be reclaimed upon cgoup deletion.
> */
> - if (rcu_dereference(iog->key))
> - disk_time += iog->entity.total_service;
> + if (rcu_dereference(iog->key)) {
> + seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> + MINOR(iog->dev),
> + iog->entity.total_service);
Hi Vivek,
I think it's easier for users if device name is also shown here.
> + }
> }
> + spin_unlock_irq(&iocg->lock);
> rcu_read_unlock();
>
> - return disk_time;
> + cgroup_unlock();
> +
> + return 0;
> }
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:44 ` Vivek Goyal
2009-05-13 15:29 ` Vivek Goyal
` (3 subsequent siblings)
4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:44 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
Thanks for the patch Gui. I will test it out and let you know how does
it go.
Thanks
Vivek
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> + io_group_init_entity(iocg, iog, key);
> iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key)
> +{
> + struct policy_node *pn;
> +
> + if (list_empty(&iocg->list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->list, node) {
> + if (pn->key == key)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> + struct block_device *bdev;
> + void *key = NULL;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return NULL;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + key = (void *)&disk->queue->elevator->efqd;
> + bdput(bdev);
> +
> + return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> + char *s[3];
> + char *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while (i < ARRAY_SIZE(s)) {
> + p = strsep(&buf, ":");
> + if (!p)
> + break;
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + newpn->key = devname_to_efqd(s[0]);
> + if (!newpn->key)
> + return -EINVAL;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->key);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->key == newpn->key) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 14:44 ` Vivek Goyal
@ 2009-05-13 15:29 ` Vivek Goyal
2009-05-13 15:59 ` Vivek Goyal
` (2 subsequent siblings)
4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:29 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
[..]
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
Would "io_policy_node" be better?
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
How about "struct list_head policy_list" or "struct list_head io_policy"?
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 14:44 ` Vivek Goyal
2009-05-13 15:29 ` Vivek Goyal
@ 2009-05-13 15:59 ` Vivek Goyal
2009-05-13 17:17 ` Vivek Goyal
2009-05-13 19:09 ` Vivek Goyal
4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:59 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
Hi Gui,
It might make sense to also store the device name or device major and
minor number in io_group while creating the io group. This will help us
to display io.disk_time and io.disk_sector statistics per device instead
of aggregate.
I am attaching a patch I was playing around with to display per device
statistics instead of aggregate one. So if user has specified the per
device rule.
Thanks
Vivek
o Currently the statistics exported through cgroup are aggregate of statistics
on all devices for that cgroup. Instead of aggregate, make these per device.
o Also export another statistics io.disk_dequeue. This keeps a count of how
many times a particular group got out of race for the disk. This is a
debugging aid to keep a track how often we could create continuously
backlogged queues.
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
block/elevator-fq.c | 127 +++++++++++++++++++++++++++++++++-------------------
block/elevator-fq.h | 3 +
2 files changed, 85 insertions(+), 45 deletions(-)
Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h 2009-05-13 11:40:32.000000000 -0400
+++ linux14/block/elevator-fq.h 2009-05-13 11:40:57.000000000 -0400
@@ -250,6 +250,9 @@ struct io_group {
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
unsigned short iocg_id;
+ dev_t dev;
+ /* How many times this group has been removed from active tree */
+ unsigned long dequeue;
#endif
};
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c 2009-05-13 11:40:53.000000000 -0400
+++ linux14/block/elevator-fq.c 2009-05-13 11:40:57.000000000 -0400
@@ -12,6 +12,7 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
BUG_ON(sd->active_entity == entity);
BUG_ON(sd->next_active == entity);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ struct io_group *iog = io_entity_to_iog(entity);
+ /*
+ * Keep track of how many times a group has been removed
+ * from active tree because it did not have any active
+ * backlogged ioq under it
+ */
+ if (iog)
+ iog->dequeue++;
+ }
+#endif
return ret;
}
@@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
+ struct io_cgroup *iocg;
struct io_group *iog;
struct hlist_node *n;
- u64 disk_time = 0;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
rcu_read_lock();
+ spin_lock_irq(&iocg->lock);
hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
/*
* There might be groups which are not functional and
* waiting to be reclaimed upon cgoup deletion.
*/
- if (rcu_dereference(iog->key))
- disk_time += iog->entity.total_service;
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_service);
+ }
}
+ spin_unlock_irq(&iocg->lock);
rcu_read_unlock();
- return disk_time;
+ cgroup_unlock();
+
+ return 0;
}
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
- struct cftype *cftype)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
struct io_cgroup *iocg;
- u64 ret;
+ struct io_group *iog;
+ struct hlist_node *n;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
- spin_lock_irq(&iocg->lock);
- ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
- spin_unlock_irq(&iocg->lock);
-
- cgroup_unlock();
-
- return ret;
-}
-
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
-{
- struct io_group *iog;
- struct hlist_node *n;
- u64 disk_sectors = 0;
rcu_read_lock();
+ spin_lock_irq(&iocg->lock);
hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
/*
* There might be groups which are not functional and
* waiting to be reclaimed upon cgoup deletion.
*/
- if (rcu_dereference(iog->key))
- disk_sectors += iog->entity.total_sector_service;
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_sector_service);
+ }
}
+ spin_unlock_irq(&iocg->lock);
rcu_read_unlock();
- return disk_sectors;
+ cgroup_unlock();
+
+ return 0;
}
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
- struct cftype *cftype)
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
{
- struct io_cgroup *iocg;
- u64 ret;
+ struct io_cgroup *iocg = NULL;
+ struct io_group *iog = NULL;
+ struct hlist_node *n;
if (!cgroup_lock_live_group(cgroup))
return -ENODEV;
iocg = cgroup_to_io_cgroup(cgroup);
+
+ rcu_read_lock();
spin_lock_irq(&iocg->lock);
- ret = calculate_aggr_disk_sectors(iocg);
+ /* Loop through all the io groups and print statistics */
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (rcu_dereference(iog->key)) {
+ seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev), iog->dequeue);
+ }
+ }
spin_unlock_irq(&iocg->lock);
+ rcu_read_unlock();
cgroup_unlock();
- return ret;
+ return 0;
}
/**
@@ -1222,7 +1248,7 @@ static u64 io_cgroup_disk_sectors_read(s
* to the root has already an allocated group on @bfqd.
*/
struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
- struct cgroup *cgroup)
+ struct cgroup *cgroup, struct bio *bio)
{
struct io_cgroup *iocg;
struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1250,8 +1276,13 @@ struct io_group *io_group_chain_alloc(st
io_group_init_entity(iocg, iog);
iog->my_entity = &iog->entity;
+
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
+ if (bio) {
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ iog->dev = MKDEV(disk->major, disk->first_minor);
+ }
#endif
blk_init_request_list(&iog->rl);
@@ -1364,7 +1395,7 @@ void io_group_chain_link(struct request_
*/
struct io_group *io_find_alloc_group(struct request_queue *q,
struct cgroup *cgroup, struct elv_fq_data *efqd,
- int create)
+ int create, struct bio *bio)
{
struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
struct io_group *iog = NULL;
@@ -1375,7 +1406,7 @@ struct io_group *io_find_alloc_group(str
if (iog != NULL || !create)
return iog;
- iog = io_group_chain_alloc(q, key, cgroup);
+ iog = io_group_chain_alloc(q, key, cgroup, bio);
if (iog != NULL)
io_group_chain_link(q, key, cgroup, iog, efqd);
@@ -1481,7 +1512,7 @@ struct io_group *io_get_io_group(struct
goto out;
}
- iog = io_find_alloc_group(q, cgroup, efqd, create);
+ iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
if (!iog) {
if (create)
iog = efqd->root_group;
@@ -1554,12 +1585,18 @@ struct cftype bfqio_files[] = {
},
{
.name = "disk_time",
- .read_u64 = io_cgroup_disk_time_read,
+ .read_seq_string = io_cgroup_disk_time_read,
},
{
.name = "disk_sectors",
- .read_u64 = io_cgroup_disk_sectors_read,
+ .read_seq_string = io_cgroup_disk_sectors_read,
},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ .name = "disk_dequeue",
+ .read_seq_string = io_cgroup_disk_dequeue_read,
+ },
+#endif
};
int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
` (2 preceding siblings ...)
2009-05-13 15:59 ` Vivek Goyal
@ 2009-05-13 17:17 ` Vivek Goyal
2009-05-13 19:09 ` Vivek Goyal
4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 17:17 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
version above because it can be called with request queue lock held and we
don't want to enable the interrupts unconditionally here.
I hit following lock validator warning.
[ 81.521242] =================================
[ 81.522127] [ INFO: inconsistent lock state ]
[ 81.522127] 2.6.30-rc4-ioc #47
[ 81.522127] ---------------------------------
[ 81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 81.522127] (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[ 81.522127] {IN-SOFTIRQ-W} state was registered at:
[ 81.522127] [<ffffffffffffffff>] 0xffffffffffffffff
[ 81.522127] irq event stamp: 1006
[ 81.522127] hardirqs last enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
[ 81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
[ 81.522127] softirqs last enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
[ 81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
[ 81.522127]
[ 81.522127] other info that might help us debug this:
[ 81.522127] 3 locks held by io-group-bw-tes/4138:
[ 81.522127] #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
[ 81.522127] #1: (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[ 81.522127] #2: (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
[ 81.522127]
[ 81.522127] stack backtrace:
[ 81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
[ 81.522127] Call Trace:
[ 81.522127] [<ffffffff8105edad>] valid_state+0x17c/0x18f
[ 81.522127] [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
[ 81.522127] [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
[ 81.522127] [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
[ 81.522127] [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
[ 81.522127] [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
[ 81.522127] [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
[ 81.522127] [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
[ 81.522127] [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
[ 81.522127] [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
[ 81.522127] [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
[ 81.522127] [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
[ 81.522127] [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
[ 81.522127] [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
[ 81.522127] [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
[ 81.522127] [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
[ 81.522127] [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
[ 81.522127] [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
[ 81.522127] [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
[ 81.522127] [<ffffffff811d8019>] submit_bio+0xb1/0xbc
[ 81.522127] [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
[ 81.522127] [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
[ 81.522127] [<ffffffff81122286>] ext3_iget+0x69/0x399
[ 81.522127] [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
[ 81.522127] [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
[ 81.522127] [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
[ 81.522127] [<ffffffff810d1976>] path_walk+0x4e/0x97
[ 81.522127] [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
[ 81.522127] [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
[ 81.522127] [<ffffffff810d252a>] user_path_at+0x52/0x8c
[ 81.522127] [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
[ 81.522127] [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
[ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
[ 81.522127] [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
[ 81.522127] [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
[ 81.522127] [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
[ 81.522127] [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
[ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
[ 81.522127] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
` (3 preceding siblings ...)
2009-05-13 17:17 ` Vivek Goyal
@ 2009-05-13 19:09 ` Vivek Goyal
4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 19:09 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
Hi Gui,
Noticed few things during testing.
1. Writing 0 as weight is not removing the policy for me, if I swich the
IO scheduler on the device.
- echo "/dev/sdb:500:2" > io.policy
- Change elevator on device /sdb
- echo "/dev/sdb:0:2" > io.policy
- cat io.policy
The old rule is not gone away.
2. One can add same rule twice after chaning elevator.
- echo "/dev/sdb:500:2" > io.policy
- Change elevator on device /sdb
- echo "/dev/sdb:500:2" > io.policy
- cat io.policy
Same rule appears twice
3. If one writes to io.weight, it should not update the weight for a
device if there is a rule for the device already. For example, if a
cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
500 and then change the io.weight=200, it should not be udpated for
for groups on /dev/sdb. Why?, because I think it will make more sense
to keep the simple rule that as long there is a rule for device, it
always overrides the generic settings of io.weight.
4. Wrong rule should return invalid value instead we see oops.
- echo "/dev/sdb:0:" > io.policy
[ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0
[ 2651.588301] Oops: 0000 [#2] SMP
[ 2651.588301] last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
[ 2651.588301] CPU 2
[ 2651.588301] Modules linked in:
[ 2651.588301] Pid: 4538, comm: bash Tainted: G D 2.6.30-rc4-ioc
#52 HP xw6600 Workstation
[ 2651.588301] RIP: 0010:[<ffffffff811f035c>] [<ffffffff811f035c>]
strict_strtoul+0x24/0x79
[ 2651.588301] RSP: 0018:ffff88003dd73dc0 EFLAGS: 00010286
[ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffffffffffff
[ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
0000000000000000
[ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
ffff88003dd73cf8
[ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
ffff88003f4a1e00
[ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
ffff88003fa7ed40
[ 2651.588301] FS: 00007ff971c466f0(0000) GS:ffff88000209c000(0000)
knlGS:0000000000000000
[ 2651.588301] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
00000000000006e0
[ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
ffff880038d98000)
[ 2651.588301] Stack:
[ 2651.588301] ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
ffff88003f4a1e00
[ 2651.588301] ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
ffff880038dd2780
[ 2651.588301] ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
0000000000000000
[ 2651.588301] Call Trace:
[ 2651.588301] [<ffffffff810d8f23>] ? iput+0x2f/0x65
[ 2651.588301] [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
[ 2651.588301] [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
[ 2651.588301] [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
[ 2651.588301] [<ffffffff810c8394>] vfs_write+0xab/0x105
[ 2651.588301] [<ffffffff810c84a8>] sys_write+0x47/0x6c
[ 2651.588301] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
[ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
<f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48
[ 2651.588301] RIP [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] RSP <ffff88003dd73dc0>
[ 2651.588301] CR2: 0000000000000000
[ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
Thanks
Vivek
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> + io_group_init_entity(iocg, iog, key);
> iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key)
> +{
> + struct policy_node *pn;
> +
> + if (list_empty(&iocg->list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->list, node) {
> + if (pn->key == key)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> + struct block_device *bdev;
> + void *key = NULL;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return NULL;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + key = (void *)&disk->queue->elevator->efqd;
> + bdput(bdev);
> +
> + return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> + char *s[3];
> + char *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while (i < ARRAY_SIZE(s)) {
> + p = strsep(&buf, ":");
> + if (!p)
> + break;
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + newpn->key = devname_to_efqd(s[0]);
> + if (!newpn->key)
> + return -EINVAL;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->key);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->key == newpn->key) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
` (3 preceding siblings ...)
[not found] ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 17:17 ` Vivek Goyal
[not found] ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 1:24 ` Gui Jianfeng
2009-05-13 19:09 ` Vivek Goyal
5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 17:17 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
version above because it can be called with request queue lock held and we
don't want to enable the interrupts unconditionally here.
I hit following lock validator warning.
[ 81.521242] =================================
[ 81.522127] [ INFO: inconsistent lock state ]
[ 81.522127] 2.6.30-rc4-ioc #47
[ 81.522127] ---------------------------------
[ 81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 81.522127] (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[ 81.522127] {IN-SOFTIRQ-W} state was registered at:
[ 81.522127] [<ffffffffffffffff>] 0xffffffffffffffff
[ 81.522127] irq event stamp: 1006
[ 81.522127] hardirqs last enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
[ 81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
[ 81.522127] softirqs last enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
[ 81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
[ 81.522127]
[ 81.522127] other info that might help us debug this:
[ 81.522127] 3 locks held by io-group-bw-tes/4138:
[ 81.522127] #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
[ 81.522127] #1: (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[ 81.522127] #2: (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
[ 81.522127]
[ 81.522127] stack backtrace:
[ 81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
[ 81.522127] Call Trace:
[ 81.522127] [<ffffffff8105edad>] valid_state+0x17c/0x18f
[ 81.522127] [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
[ 81.522127] [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
[ 81.522127] [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
[ 81.522127] [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
[ 81.522127] [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
[ 81.522127] [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
[ 81.522127] [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
[ 81.522127] [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
[ 81.522127] [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
[ 81.522127] [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
[ 81.522127] [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
[ 81.522127] [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
[ 81.522127] [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
[ 81.522127] [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
[ 81.522127] [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
[ 81.522127] [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
[ 81.522127] [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
[ 81.522127] [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
[ 81.522127] [<ffffffff811d8019>] submit_bio+0xb1/0xbc
[ 81.522127] [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
[ 81.522127] [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
[ 81.522127] [<ffffffff81122286>] ext3_iget+0x69/0x399
[ 81.522127] [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
[ 81.522127] [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
[ 81.522127] [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
[ 81.522127] [<ffffffff810d1976>] path_walk+0x4e/0x97
[ 81.522127] [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
[ 81.522127] [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
[ 81.522127] [<ffffffff810d252a>] user_path_at+0x52/0x8c
[ 81.522127] [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
[ 81.522127] [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
[ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
[ 81.522127] [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
[ 81.522127] [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
[ 81.522127] [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
[ 81.522127] [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
[ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[ 81.522127] [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
[ 81.522127] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
Thanks
Vivek
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 1:24 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:24 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
...
>> + }
>> + spin_unlock_irq(&iocg->lock);
>>
>
> I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
> version above because it can be called with request queue lock held and we
> don't want to enable the interrupts unconditionally here.
Will change.
>
> I hit following lock validator warning.
>
>
> [ 81.521242] =================================
> [ 81.522127] [ INFO: inconsistent lock state ]
> [ 81.522127] 2.6.30-rc4-ioc #47
> [ 81.522127] ---------------------------------
> [ 81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [ 81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 81.522127] (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [ 81.522127] {IN-SOFTIRQ-W} state was registered at:
> [ 81.522127] [<ffffffffffffffff>] 0xffffffffffffffff
> [ 81.522127] irq event stamp: 1006
> [ 81.522127] hardirqs last enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
> [ 81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
> [ 81.522127] softirqs last enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
> [ 81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
> [ 81.522127]
> [ 81.522127] other info that might help us debug this:
> [ 81.522127] 3 locks held by io-group-bw-tes/4138:
> [ 81.522127] #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
> [ 81.522127] #1: (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [ 81.522127] #2: (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
> [ 81.522127]
> [ 81.522127] stack backtrace:
> [ 81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
> [ 81.522127] Call Trace:
> [ 81.522127] [<ffffffff8105edad>] valid_state+0x17c/0x18f
> [ 81.522127] [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
> [ 81.522127] [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
> [ 81.522127] [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
> [ 81.522127] [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
> [ 81.522127] [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
> [ 81.522127] [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
> [ 81.522127] [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
> [ 81.522127] [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
> [ 81.522127] [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
> [ 81.522127] [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
> [ 81.522127] [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
> [ 81.522127] [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
> [ 81.522127] [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
> [ 81.522127] [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
> [ 81.522127] [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
> [ 81.522127] [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
> [ 81.522127] [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
> [ 81.522127] [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
> [ 81.522127] [<ffffffff811d8019>] submit_bio+0xb1/0xbc
> [ 81.522127] [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
> [ 81.522127] [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
> [ 81.522127] [<ffffffff81122286>] ext3_iget+0x69/0x399
> [ 81.522127] [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
> [ 81.522127] [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
> [ 81.522127] [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
> [ 81.522127] [<ffffffff810d1976>] path_walk+0x4e/0x97
> [ 81.522127] [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
> [ 81.522127] [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
> [ 81.522127] [<ffffffff810d252a>] user_path_at+0x52/0x8c
> [ 81.522127] [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
> [ 81.522127] [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
> [ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
> [ 81.522127] [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
> [ 81.522127] [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
> [ 81.522127] [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
> [ 81.522127] [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
> [ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
> [ 81.522127] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
>
> Thanks
> Vivek
>
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 17:17 ` Vivek Goyal
[not found] ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 1:24 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:24 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
...
>> + }
>> + spin_unlock_irq(&iocg->lock);
>>
>
> I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
> version above because it can be called with request queue lock held and we
> don't want to enable the interrupts unconditionally here.
Will change.
>
> I hit following lock validator warning.
>
>
> [ 81.521242] =================================
> [ 81.522127] [ INFO: inconsistent lock state ]
> [ 81.522127] 2.6.30-rc4-ioc #47
> [ 81.522127] ---------------------------------
> [ 81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [ 81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 81.522127] (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [ 81.522127] {IN-SOFTIRQ-W} state was registered at:
> [ 81.522127] [<ffffffffffffffff>] 0xffffffffffffffff
> [ 81.522127] irq event stamp: 1006
> [ 81.522127] hardirqs last enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
> [ 81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
> [ 81.522127] softirqs last enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
> [ 81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
> [ 81.522127]
> [ 81.522127] other info that might help us debug this:
> [ 81.522127] 3 locks held by io-group-bw-tes/4138:
> [ 81.522127] #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
> [ 81.522127] #1: (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [ 81.522127] #2: (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
> [ 81.522127]
> [ 81.522127] stack backtrace:
> [ 81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
> [ 81.522127] Call Trace:
> [ 81.522127] [<ffffffff8105edad>] valid_state+0x17c/0x18f
> [ 81.522127] [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
> [ 81.522127] [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
> [ 81.522127] [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
> [ 81.522127] [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
> [ 81.522127] [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
> [ 81.522127] [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
> [ 81.522127] [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
> [ 81.522127] [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
> [ 81.522127] [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
> [ 81.522127] [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
> [ 81.522127] [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
> [ 81.522127] [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
> [ 81.522127] [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
> [ 81.522127] [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
> [ 81.522127] [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
> [ 81.522127] [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
> [ 81.522127] [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
> [ 81.522127] [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
> [ 81.522127] [<ffffffff811d8019>] submit_bio+0xb1/0xbc
> [ 81.522127] [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
> [ 81.522127] [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
> [ 81.522127] [<ffffffff81122286>] ext3_iget+0x69/0x399
> [ 81.522127] [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
> [ 81.522127] [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
> [ 81.522127] [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
> [ 81.522127] [<ffffffff810d1976>] path_walk+0x4e/0x97
> [ 81.522127] [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
> [ 81.522127] [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
> [ 81.522127] [<ffffffff810d252a>] user_path_at+0x52/0x8c
> [ 81.522127] [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
> [ 81.522127] [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
> [ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
> [ 81.522127] [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
> [ 81.522127] [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
> [ 81.522127] [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
> [ 81.522127] [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
> [ 81.522127] [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [ 81.522127] [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
> [ 81.522127] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
>
> Thanks
> Vivek
>
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
` (4 preceding siblings ...)
2009-05-13 17:17 ` Vivek Goyal
@ 2009-05-13 19:09 ` Vivek Goyal
2009-05-14 1:35 ` Gui Jianfeng
` (2 more replies)
5 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 19:09 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
Hi Gui,
Noticed few things during testing.
1. Writing 0 as weight is not removing the policy for me, if I swich the
IO scheduler on the device.
- echo "/dev/sdb:500:2" > io.policy
- Change elevator on device /sdb
- echo "/dev/sdb:0:2" > io.policy
- cat io.policy
The old rule is not gone away.
2. One can add same rule twice after chaning elevator.
- echo "/dev/sdb:500:2" > io.policy
- Change elevator on device /sdb
- echo "/dev/sdb:500:2" > io.policy
- cat io.policy
Same rule appears twice
3. If one writes to io.weight, it should not update the weight for a
device if there is a rule for the device already. For example, if a
cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
500 and then change the io.weight=200, it should not be udpated for
for groups on /dev/sdb. Why?, because I think it will make more sense
to keep the simple rule that as long there is a rule for device, it
always overrides the generic settings of io.weight.
4. Wrong rule should return invalid value instead we see oops.
- echo "/dev/sdb:0:" > io.policy
[ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0
[ 2651.588301] Oops: 0000 [#2] SMP
[ 2651.588301] last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
[ 2651.588301] CPU 2
[ 2651.588301] Modules linked in:
[ 2651.588301] Pid: 4538, comm: bash Tainted: G D 2.6.30-rc4-ioc
#52 HP xw6600 Workstation
[ 2651.588301] RIP: 0010:[<ffffffff811f035c>] [<ffffffff811f035c>]
strict_strtoul+0x24/0x79
[ 2651.588301] RSP: 0018:ffff88003dd73dc0 EFLAGS: 00010286
[ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffffffffffff
[ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
0000000000000000
[ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
ffff88003dd73cf8
[ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
ffff88003f4a1e00
[ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
ffff88003fa7ed40
[ 2651.588301] FS: 00007ff971c466f0(0000) GS:ffff88000209c000(0000)
knlGS:0000000000000000
[ 2651.588301] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
00000000000006e0
[ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
ffff880038d98000)
[ 2651.588301] Stack:
[ 2651.588301] ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
ffff88003f4a1e00
[ 2651.588301] ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
ffff880038dd2780
[ 2651.588301] ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
0000000000000000
[ 2651.588301] Call Trace:
[ 2651.588301] [<ffffffff810d8f23>] ? iput+0x2f/0x65
[ 2651.588301] [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
[ 2651.588301] [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
[ 2651.588301] [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
[ 2651.588301] [<ffffffff810c8394>] vfs_write+0xab/0x105
[ 2651.588301] [<ffffffff810c84a8>] sys_write+0x47/0x6c
[ 2651.588301] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
[ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
<f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48
[ 2651.588301] RIP [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] RSP <ffff88003dd73dc0>
[ 2651.588301] CR2: 0000000000000000
[ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
Thanks
Vivek
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
> block/elevator-fq.h | 11 +++
> 2 files changed, 245 insertions(+), 5 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + void *key)
> {
> struct io_entity *entity = &iog->entity;
> + struct policy_node *pn;
> +
> + spin_lock_irq(&iocg->lock);
> + pn = policy_search_node(iocg, key);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irq(&iocg->lock);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> + io_group_init_entity(iocg, iog, key);
> iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> + void *key)
> +{
> + struct policy_node *pn;
> +
> + if (list_empty(&iocg->list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->list, node) {
> + if (pn->key == key)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> + struct block_device *bdev;
> + void *key = NULL;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return NULL;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + key = (void *)&disk->queue->elevator->efqd;
> + bdput(bdev);
> +
> + return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> + char *s[3];
> + char *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while (i < ARRAY_SIZE(s)) {
> + p = strsep(&buf, ":");
> + if (!p)
> + break;
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + newpn->key = devname_to_efqd(s[0]);
> + if (!newpn->key)
> + return -EINVAL;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->key);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->key == newpn->key) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
> {
> entity->ioprio = entity->new_ioprio;
> - entity->weight = entity->new_weight;
> + entity->weight = entity->new_weigh;
> entity->ioprio_class = entity->new_ioprio_class;
> entity->sched_data = &iog->sched_data;
> }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
> #endif
> };
>
> +struct policy_node {
> + struct list_head node;
> + char dev_name[32];
> + void *key;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of policy_node */
> + struct list_head list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 19:09 ` Vivek Goyal
@ 2009-05-14 1:35 ` Gui Jianfeng
[not found] ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 7:26 ` Gui Jianfeng
2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:35 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
>> # echo /dev/hda:500:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hda 500 1
>> /dev/hdb 300 2
>>
>> Remove the policy for /dev/hda in this cgroup
>> # echo /dev/hda:0:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>
> Hi Gui,
>
> Noticed few things during testing.
>
> 1. Writing 0 as weight is not removing the policy for me, if I swich the
> IO scheduler on the device.
>
> - echo "/dev/sdb:500:2" > io.policy
> - Change elevator on device /sdb
> - echo "/dev/sdb:0:2" > io.policy
> - cat io.policy
> The old rule is not gone away.
>
> 2. One can add same rule twice after chaning elevator.
>
> - echo "/dev/sdb:500:2" > io.policy
> - Change elevator on device /sdb
> - echo "/dev/sdb:500:2" > io.policy
> - cat io.policy
>
> Same rule appears twice
>
> 3. If one writes to io.weight, it should not update the weight for a
> device if there is a rule for the device already. For example, if a
> cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
> 500 and then change the io.weight=200, it should not be udpated for
> for groups on /dev/sdb. Why?, because I think it will make more sense
> to keep the simple rule that as long there is a rule for device, it
> always overrides the generic settings of io.weight.
>
> 4. Wrong rule should return invalid value instead we see oops.
>
> - echo "/dev/sdb:0:" > io.policy
Hi Vivek,
Thanks for testing, i'll fix the above problems, and send an update version.
>
> [ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0
> [ 2651.588301] Oops: 0000 [#2] SMP
> [ 2651.588301] last sysfs file:
> /sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
> [ 2651.588301] CPU 2
> [ 2651.588301] Modules linked in:
> [ 2651.588301] Pid: 4538, comm: bash Tainted: G D 2.6.30-rc4-ioc
> #52 HP xw6600 Workstation
> [ 2651.588301] RIP: 0010:[<ffffffff811f035c>] [<ffffffff811f035c>]
> strict_strtoul+0x24/0x79
> [ 2651.588301] RSP: 0018:ffff88003dd73dc0 EFLAGS: 00010286
> [ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> ffffffffffffffff
> [ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
> 0000000000000000
> [ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
> ffff88003dd73cf8
> [ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
> ffff88003f4a1e00
> [ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
> ffff88003fa7ed40
> [ 2651.588301] FS: 00007ff971c466f0(0000) GS:ffff88000209c000(0000)
> knlGS:0000000000000000
> [ 2651.588301] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
> 00000000000006e0
> [ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
> ffff880038d98000)
> [ 2651.588301] Stack:
> [ 2651.588301] ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
> ffff88003f4a1e00
> [ 2651.588301] ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
> ffff880038dd2780
> [ 2651.588301] ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
> 0000000000000000
> [ 2651.588301] Call Trace:
> [ 2651.588301] [<ffffffff810d8f23>] ? iput+0x2f/0x65
> [ 2651.588301] [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
> [ 2651.588301] [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
> [ 2651.588301] [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
> [ 2651.588301] [<ffffffff810c8394>] vfs_write+0xab/0x105
> [ 2651.588301] [<ffffffff810c84a8>] sys_write+0x47/0x6c
> [ 2651.588301] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> [ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
> 41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
> <f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48
> [ 2651.588301] RIP [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] RSP <ffff88003dd73dc0>
> [ 2651.588301] CR2: 0000000000000000
> [ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>> block/elevator-fq.h | 11 +++
>> 2 files changed, 245 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 69435ab..7c95d55 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -12,6 +12,9 @@
>> #include "elevator-fq.h"
>> #include <linux/blktrace_api.h>
>> #include <linux/biotrack.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/genhd.h>
>> +
>>
>> /* Values taken from cfq */
>> const int elv_slice_sync = HZ / 10;
>> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>> }
>> EXPORT_SYMBOL(io_lookup_io_group_current);
>>
>> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> + void *key);
>> +
>> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
>> + void *key)
>> {
>> struct io_entity *entity = &iog->entity;
>> + struct policy_node *pn;
>> +
>> + spin_lock_irq(&iocg->lock);
>> + pn = policy_search_node(iocg, key);
>> + if (pn) {
>> + entity->weight = pn->weight;
>> + entity->new_weight = pn->weight;
>> + entity->ioprio_class = pn->ioprio_class;
>> + entity->new_ioprio_class = pn->ioprio_class;
>> + } else {
>> + entity->weight = iocg->weight;
>> + entity->new_weight = iocg->weight;
>> + entity->ioprio_class = iocg->ioprio_class;
>> + entity->new_ioprio_class = iocg->ioprio_class;
>> + }
>> + spin_unlock_irq(&iocg->lock);
>>
>> - entity->weight = entity->new_weight = iocg->weight;
>> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>> entity->ioprio_changed = 1;
>> entity->my_sched_data = &iog->sched_data;
>> }
>> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>> atomic_set(&iog->ref, 0);
>> iog->deleting = 0;
>>
>> - io_group_init_entity(iocg, iog);
>> + io_group_init_entity(iocg, iog, key);
>> iog->my_entity = &iog->entity;
>> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>> iog->iocg_id = css_id(&iocg->css);
>> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>> return iog;
>> }
>>
>> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
>> + struct seq_file *m)
>> +{
>> + struct io_cgroup *iocg;
>> + struct policy_node *pn;
>> +
>> + iocg = cgroup_to_io_cgroup(cgrp);
>> +
>> + if (list_empty(&iocg->list))
>> + goto out;
>> +
>> + seq_printf(m, "dev weight class\n");
>> +
>> + spin_lock_irq(&iocg->lock);
>> + list_for_each_entry(pn, &iocg->list, node) {
>> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
>> + pn->weight, pn->ioprio_class);
>> + }
>> + spin_unlock_irq(&iocg->lock);
>> +out:
>> + return 0;
>> +}
>> +
>> +static inline void policy_insert_node(struct io_cgroup *iocg,
>> + struct policy_node *pn)
>> +{
>> + list_add(&pn->node, &iocg->list);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static inline void policy_delete_node(struct policy_node *pn)
>> +{
>> + list_del(&pn->node);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> + void *key)
>> +{
>> + struct policy_node *pn;
>> +
>> + if (list_empty(&iocg->list))
>> + return NULL;
>> +
>> + list_for_each_entry(pn, &iocg->list, node) {
>> + if (pn->key == key)
>> + return pn;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static void *devname_to_efqd(const char *buf)
>> +{
>> + struct block_device *bdev;
>> + void *key = NULL;
>> + struct gendisk *disk;
>> + int part;
>> +
>> + bdev = lookup_bdev(buf);
>> + if (IS_ERR(bdev))
>> + return NULL;
>> +
>> + disk = get_gendisk(bdev->bd_dev, &part);
>> + key = (void *)&disk->queue->elevator->efqd;
>> + bdput(bdev);
>> +
>> + return key;
>> +}
>> +
>> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
>> +{
>> + char *s[3];
>> + char *p;
>> + int ret;
>> + int i = 0;
>> +
>> + memset(s, 0, sizeof(s));
>> + while (i < ARRAY_SIZE(s)) {
>> + p = strsep(&buf, ":");
>> + if (!p)
>> + break;
>> + if (!*p)
>> + continue;
>> + s[i++] = p;
>> + }
>> +
>> + newpn->key = devname_to_efqd(s[0]);
>> + if (!newpn->key)
>> + return -EINVAL;
>> +
>> + strcpy(newpn->dev_name, s[0]);
>> +
>> + ret = strict_strtoul(s[1], 10, &newpn->weight);
>> + if (ret || newpn->weight > WEIGHT_MAX)
>> + return -EINVAL;
>> +
>> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
>> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
>> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
>> + const char *buffer)
>> +{
>> + struct io_cgroup *iocg;
>> + struct policy_node *newpn, *pn;
>> + char *buf;
>> + int ret = 0;
>> + int keep_newpn = 0;
>> + struct hlist_node *n;
>> + struct io_group *iog;
>> +
>> + buf = kstrdup(buffer, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
>> + if (!newpn) {
>> + ret = -ENOMEM;
>> + goto free_buf;
>> + }
>> +
>> + ret = policy_parse_and_set(buf, newpn);
>> + if (ret)
>> + goto free_newpn;
>> +
>> + if (!cgroup_lock_live_group(cgrp)) {
>> + ret = -ENODEV;
>> + goto free_newpn;
>> + }
>> +
>> + iocg = cgroup_to_io_cgroup(cgrp);
>> + spin_lock_irq(&iocg->lock);
>> +
>> + pn = policy_search_node(iocg, newpn->key);
>> + if (!pn) {
>> + if (newpn->weight != 0) {
>> + policy_insert_node(iocg, newpn);
>> + keep_newpn = 1;
>> + }
>> + goto update_io_group;
>> + }
>> +
>> + if (newpn->weight == 0) {
>> + /* weight == 0 means deleteing a policy */
>> + policy_delete_node(pn);
>> + goto update_io_group;
>> + }
>> +
>> + pn->weight = newpn->weight;
>> + pn->ioprio_class = newpn->ioprio_class;
>> +
>> +update_io_group:
>> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
>> + if (iog->key == newpn->key) {
>> + if (newpn->weight) {
>> + iog->entity.new_weight = newpn->weight;
>> + iog->entity.new_ioprio_class =
>> + newpn->ioprio_class;
>> + /*
>> + * iog weight and ioprio_class updating
>> + * actually happens if ioprio_changed is set.
>> + * So ensure ioprio_changed is not set until
>> + * new weight and new ioprio_class are updated.
>> + */
>> + smp_wmb();
>> + iog->entity.ioprio_changed = 1;
>> + } else {
>> + iog->entity.new_weight = iocg->weight;
>> + iog->entity.new_ioprio_class =
>> + iocg->ioprio_class;
>> +
>> + /* The same as above */
>> + smp_wmb();
>> + iog->entity.ioprio_changed = 1;
>> + }
>> + }
>> + }
>> + spin_unlock_irq(&iocg->lock);
>> +
>> + cgroup_unlock();
>> +
>> +free_newpn:
>> + if (!keep_newpn)
>> + kfree(newpn);
>> +free_buf:
>> + kfree(buf);
>> + return ret;
>> +}
>> +
>> struct cftype bfqio_files[] = {
>> {
>> + .name = "policy",
>> + .read_seq_string = io_cgroup_policy_read,
>> + .write_string = io_cgroup_policy_write,
>> + .max_write_len = 256,
>> + },
>> + {
>> .name = "weight",
>> .read_u64 = io_cgroup_weight_read,
>> .write_u64 = io_cgroup_weight_write,
>> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>> INIT_HLIST_HEAD(&iocg->group_data);
>> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>> + INIT_LIST_HEAD(&iocg->list);
>>
>> return &iocg->css;
>> }
>> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> unsigned long flags, flags1;
>> int queue_lock_held = 0;
>> struct elv_fq_data *efqd;
>> + struct policy_node *pn, *pntmp;
>>
>> /*
>> * io groups are linked in two lists. One list is maintained
>> @@ -1823,6 +2046,12 @@ locked:
>> BUG_ON(!hlist_empty(&iocg->group_data));
>>
>> free_css_id(&io_subsys, &iocg->css);
>> +
>> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
>> + policy_delete_node(pn);
>> + kfree(pn);
>> + }
>> +
>> kfree(iocg);
>> }
>>
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>> {
>> entity->ioprio = entity->new_ioprio;
>> - entity->weight = entity->new_weight;
>> + entity->weight = entity->new_weigh;
>> entity->ioprio_class = entity->new_ioprio_class;
>> entity->sched_data = &iog->sched_data;
>> }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>> #endif
>> };
>>
>> +struct policy_node {
>> + struct list_head node;
>> + char dev_name[32];
>> + void *key;
>> + unsigned long weight;
>> + unsigned long ioprio_class;
>> +};
>> +
>> /**
>> * struct bfqio_cgroup - bfq cgroup data structure.
>> * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>
>> unsigned long weight, ioprio_class;
>>
>> + /* list of policy_node */
>> + struct list_head list;
>> +
>> spinlock_t lock;
>> struct hlist_head group_data;
>> };
>> --
>> 1.5.4.rc3
>>
>>
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
[parent not found: <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 1:35 ` Gui Jianfeng
2009-05-14 7:26 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 1:35 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
>> # echo /dev/hda:500:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hda 500 1
>> /dev/hdb 300 2
>>
>> Remove the policy for /dev/hda in this cgroup
>> # echo /dev/hda:0:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>
> Hi Gui,
>
> Noticed few things during testing.
>
> 1. Writing 0 as weight is not removing the policy for me, if I swich the
> IO scheduler on the device.
>
> - echo "/dev/sdb:500:2" > io.policy
> - Change elevator on device /sdb
> - echo "/dev/sdb:0:2" > io.policy
> - cat io.policy
> The old rule is not gone away.
>
> 2. One can add same rule twice after chaning elevator.
>
> - echo "/dev/sdb:500:2" > io.policy
> - Change elevator on device /sdb
> - echo "/dev/sdb:500:2" > io.policy
> - cat io.policy
>
> Same rule appears twice
>
> 3. If one writes to io.weight, it should not update the weight for a
> device if there is a rule for the device already. For example, if a
> cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
> 500 and then change the io.weight=200, it should not be udpated for
> for groups on /dev/sdb. Why?, because I think it will make more sense
> to keep the simple rule that as long there is a rule for device, it
> always overrides the generic settings of io.weight.
>
> 4. Wrong rule should return invalid value instead we see oops.
>
> - echo "/dev/sdb:0:" > io.policy
Hi Vivek,
Thanks for testing, i'll fix the above problems, and send an update version.
>
> [ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0
> [ 2651.588301] Oops: 0000 [#2] SMP
> [ 2651.588301] last sysfs file:
> /sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
> [ 2651.588301] CPU 2
> [ 2651.588301] Modules linked in:
> [ 2651.588301] Pid: 4538, comm: bash Tainted: G D 2.6.30-rc4-ioc
> #52 HP xw6600 Workstation
> [ 2651.588301] RIP: 0010:[<ffffffff811f035c>] [<ffffffff811f035c>]
> strict_strtoul+0x24/0x79
> [ 2651.588301] RSP: 0018:ffff88003dd73dc0 EFLAGS: 00010286
> [ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> ffffffffffffffff
> [ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
> 0000000000000000
> [ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
> ffff88003dd73cf8
> [ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
> ffff88003f4a1e00
> [ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
> ffff88003fa7ed40
> [ 2651.588301] FS: 00007ff971c466f0(0000) GS:ffff88000209c000(0000)
> knlGS:0000000000000000
> [ 2651.588301] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
> 00000000000006e0
> [ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
> ffff880038d98000)
> [ 2651.588301] Stack:
> [ 2651.588301] ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
> ffff88003f4a1e00
> [ 2651.588301] ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
> ffff880038dd2780
> [ 2651.588301] ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
> 0000000000000000
> [ 2651.588301] Call Trace:
> [ 2651.588301] [<ffffffff810d8f23>] ? iput+0x2f/0x65
> [ 2651.588301] [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
> [ 2651.588301] [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
> [ 2651.588301] [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
> [ 2651.588301] [<ffffffff810c8394>] vfs_write+0xab/0x105
> [ 2651.588301] [<ffffffff810c84a8>] sys_write+0x47/0x6c
> [ 2651.588301] [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> [ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
> 41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
> <f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48
> [ 2651.588301] RIP [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] RSP <ffff88003dd73dc0>
> [ 2651.588301] CR2: 0000000000000000
> [ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>> block/elevator-fq.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>> block/elevator-fq.h | 11 +++
>> 2 files changed, 245 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 69435ab..7c95d55 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -12,6 +12,9 @@
>> #include "elevator-fq.h"
>> #include <linux/blktrace_api.h>
>> #include <linux/biotrack.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/genhd.h>
>> +
>>
>> /* Values taken from cfq */
>> const int elv_slice_sync = HZ / 10;
>> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>> }
>> EXPORT_SYMBOL(io_lookup_io_group_current);
>>
>> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> + void *key);
>> +
>> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
>> + void *key)
>> {
>> struct io_entity *entity = &iog->entity;
>> + struct policy_node *pn;
>> +
>> + spin_lock_irq(&iocg->lock);
>> + pn = policy_search_node(iocg, key);
>> + if (pn) {
>> + entity->weight = pn->weight;
>> + entity->new_weight = pn->weight;
>> + entity->ioprio_class = pn->ioprio_class;
>> + entity->new_ioprio_class = pn->ioprio_class;
>> + } else {
>> + entity->weight = iocg->weight;
>> + entity->new_weight = iocg->weight;
>> + entity->ioprio_class = iocg->ioprio_class;
>> + entity->new_ioprio_class = iocg->ioprio_class;
>> + }
>> + spin_unlock_irq(&iocg->lock);
>>
>> - entity->weight = entity->new_weight = iocg->weight;
>> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>> entity->ioprio_changed = 1;
>> entity->my_sched_data = &iog->sched_data;
>> }
>> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>> atomic_set(&iog->ref, 0);
>> iog->deleting = 0;
>>
>> - io_group_init_entity(iocg, iog);
>> + io_group_init_entity(iocg, iog, key);
>> iog->my_entity = &iog->entity;
>> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>> iog->iocg_id = css_id(&iocg->css);
>> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>> return iog;
>> }
>>
>> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
>> + struct seq_file *m)
>> +{
>> + struct io_cgroup *iocg;
>> + struct policy_node *pn;
>> +
>> + iocg = cgroup_to_io_cgroup(cgrp);
>> +
>> + if (list_empty(&iocg->list))
>> + goto out;
>> +
>> + seq_printf(m, "dev weight class\n");
>> +
>> + spin_lock_irq(&iocg->lock);
>> + list_for_each_entry(pn, &iocg->list, node) {
>> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
>> + pn->weight, pn->ioprio_class);
>> + }
>> + spin_unlock_irq(&iocg->lock);
>> +out:
>> + return 0;
>> +}
>> +
>> +static inline void policy_insert_node(struct io_cgroup *iocg,
>> + struct policy_node *pn)
>> +{
>> + list_add(&pn->node, &iocg->list);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static inline void policy_delete_node(struct policy_node *pn)
>> +{
>> + list_del(&pn->node);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> + void *key)
>> +{
>> + struct policy_node *pn;
>> +
>> + if (list_empty(&iocg->list))
>> + return NULL;
>> +
>> + list_for_each_entry(pn, &iocg->list, node) {
>> + if (pn->key == key)
>> + return pn;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static void *devname_to_efqd(const char *buf)
>> +{
>> + struct block_device *bdev;
>> + void *key = NULL;
>> + struct gendisk *disk;
>> + int part;
>> +
>> + bdev = lookup_bdev(buf);
>> + if (IS_ERR(bdev))
>> + return NULL;
>> +
>> + disk = get_gendisk(bdev->bd_dev, &part);
>> + key = (void *)&disk->queue->elevator->efqd;
>> + bdput(bdev);
>> +
>> + return key;
>> +}
>> +
>> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
>> +{
>> + char *s[3];
>> + char *p;
>> + int ret;
>> + int i = 0;
>> +
>> + memset(s, 0, sizeof(s));
>> + while (i < ARRAY_SIZE(s)) {
>> + p = strsep(&buf, ":");
>> + if (!p)
>> + break;
>> + if (!*p)
>> + continue;
>> + s[i++] = p;
>> + }
>> +
>> + newpn->key = devname_to_efqd(s[0]);
>> + if (!newpn->key)
>> + return -EINVAL;
>> +
>> + strcpy(newpn->dev_name, s[0]);
>> +
>> + ret = strict_strtoul(s[1], 10, &newpn->weight);
>> + if (ret || newpn->weight > WEIGHT_MAX)
>> + return -EINVAL;
>> +
>> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
>> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
>> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
>> + const char *buffer)
>> +{
>> + struct io_cgroup *iocg;
>> + struct policy_node *newpn, *pn;
>> + char *buf;
>> + int ret = 0;
>> + int keep_newpn = 0;
>> + struct hlist_node *n;
>> + struct io_group *iog;
>> +
>> + buf = kstrdup(buffer, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
>> + if (!newpn) {
>> + ret = -ENOMEM;
>> + goto free_buf;
>> + }
>> +
>> + ret = policy_parse_and_set(buf, newpn);
>> + if (ret)
>> + goto free_newpn;
>> +
>> + if (!cgroup_lock_live_group(cgrp)) {
>> + ret = -ENODEV;
>> + goto free_newpn;
>> + }
>> +
>> + iocg = cgroup_to_io_cgroup(cgrp);
>> + spin_lock_irq(&iocg->lock);
>> +
>> + pn = policy_search_node(iocg, newpn->key);
>> + if (!pn) {
>> + if (newpn->weight != 0) {
>> + policy_insert_node(iocg, newpn);
>> + keep_newpn = 1;
>> + }
>> + goto update_io_group;
>> + }
>> +
>> + if (newpn->weight == 0) {
>> + /* weight == 0 means deleteing a policy */
>> + policy_delete_node(pn);
>> + goto update_io_group;
>> + }
>> +
>> + pn->weight = newpn->weight;
>> + pn->ioprio_class = newpn->ioprio_class;
>> +
>> +update_io_group:
>> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
>> + if (iog->key == newpn->key) {
>> + if (newpn->weight) {
>> + iog->entity.new_weight = newpn->weight;
>> + iog->entity.new_ioprio_class =
>> + newpn->ioprio_class;
>> + /*
>> + * iog weight and ioprio_class updating
>> + * actually happens if ioprio_changed is set.
>> + * So ensure ioprio_changed is not set until
>> + * new weight and new ioprio_class are updated.
>> + */
>> + smp_wmb();
>> + iog->entity.ioprio_changed = 1;
>> + } else {
>> + iog->entity.new_weight = iocg->weight;
>> + iog->entity.new_ioprio_class =
>> + iocg->ioprio_class;
>> +
>> + /* The same as above */
>> + smp_wmb();
>> + iog->entity.ioprio_changed = 1;
>> + }
>> + }
>> + }
>> + spin_unlock_irq(&iocg->lock);
>> +
>> + cgroup_unlock();
>> +
>> +free_newpn:
>> + if (!keep_newpn)
>> + kfree(newpn);
>> +free_buf:
>> + kfree(buf);
>> + return ret;
>> +}
>> +
>> struct cftype bfqio_files[] = {
>> {
>> + .name = "policy",
>> + .read_seq_string = io_cgroup_policy_read,
>> + .write_string = io_cgroup_policy_write,
>> + .max_write_len = 256,
>> + },
>> + {
>> .name = "weight",
>> .read_u64 = io_cgroup_weight_read,
>> .write_u64 = io_cgroup_weight_write,
>> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>> INIT_HLIST_HEAD(&iocg->group_data);
>> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>> + INIT_LIST_HEAD(&iocg->list);
>>
>> return &iocg->css;
>> }
>> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> unsigned long flags, flags1;
>> int queue_lock_held = 0;
>> struct elv_fq_data *efqd;
>> + struct policy_node *pn, *pntmp;
>>
>> /*
>> * io groups are linked in two lists. One list is maintained
>> @@ -1823,6 +2046,12 @@ locked:
>> BUG_ON(!hlist_empty(&iocg->group_data));
>>
>> free_css_id(&io_subsys, &iocg->css);
>> +
>> + list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
>> + policy_delete_node(pn);
>> + kfree(pn);
>> + }
>> +
>> kfree(iocg);
>> }
>>
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>> void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>> {
>> entity->ioprio = entity->new_ioprio;
>> - entity->weight = entity->new_weight;
>> + entity->weight = entity->new_weigh;
>> entity->ioprio_class = entity->new_ioprio_class;
>> entity->sched_data = &iog->sched_data;
>> }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>> #endif
>> };
>>
>> +struct policy_node {
>> + struct list_head node;
>> + char dev_name[32];
>> + void *key;
>> + unsigned long weight;
>> + unsigned long ioprio_class;
>> +};
>> +
>> /**
>> * struct bfqio_cgroup - bfq cgroup data structure.
>> * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>
>> unsigned long weight, ioprio_class;
>>
>> + /* list of policy_node */
>> + struct list_head list;
>> +
>> spinlock_t lock;
>> struct hlist_head group_data;
>> };
>> --
>> 1.5.4.rc3
>>
>>
>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14 1:35 ` Gui Jianfeng
@ 2009-05-14 7:26 ` Gui Jianfeng
1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 7:26 UTC (permalink / raw)
To: Vivek Goyal
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
Hi Vivek,
This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.
You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.
Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2
Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
block/elevator-fq.c | 258 +++++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 12 +++
2 files changed, 261 insertions(+), 9 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..43b30a4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+ dev_t dev)
{
struct io_entity *entity = &iog->entity;
+ struct io_policy_node *pn;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iocg->lock, flags);
+ pn = policy_search_node(iocg, dev);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->new_weight = pn->weight;
+ entity->ioprio_class = pn->ioprio_class;
+ entity->new_ioprio_class = pn->ioprio_class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->new_weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ entity->new_ioprio_class = iocg->ioprio_class;
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);
- entity->weight = entity->new_weight = iocg->weight;
- entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
entity->ioprio_changed = 1;
entity->my_sched_data = &iog->sched_data;
}
@@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
struct io_cgroup *iocg; \
struct io_group *iog; \
struct hlist_node *n; \
+ struct io_policy_node *pn; \
\
if (val < (__MIN) || val > (__MAX)) \
return -EINVAL; \
@@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
spin_lock_irq(&iocg->lock); \
iocg->__VAR = (unsigned long)val; \
hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ pn = policy_search_node(iocg, iog->dev); \
+ if (pn) \
+ continue; \
iog->entity.new_##__VAR = (unsigned long)val; \
smp_wmb(); \
iog->entity.ioprio_changed = 1; \
@@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
* to the root has already an allocated group on @bfqd.
*/
struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
- struct cgroup *cgroup)
+ struct cgroup *cgroup, struct bio *bio)
{
struct io_cgroup *iocg;
struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
atomic_set(&iog->ref, 0);
iog->deleting = 0;
- io_group_init_entity(iocg, iog);
- iog->my_entity = &iog->entity;
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
+ if (bio) {
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ iog->dev = MKDEV(disk->major, disk->first_minor);
+ }
#endif
+ io_group_init_entity(iocg, iog, iog->dev);
+ iog->my_entity = &iog->entity;
+
blk_init_request_list(&iog->rl);
if (leaf == NULL) {
@@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
*/
struct io_group *io_find_alloc_group(struct request_queue *q,
struct cgroup *cgroup, struct elv_fq_data *efqd,
- int create)
+ int create, struct bio *bio)
{
struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
struct io_group *iog = NULL;
@@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
if (iog != NULL || !create)
return iog;
- iog = io_group_chain_alloc(q, key, cgroup);
+ iog = io_group_chain_alloc(q, key, cgroup, bio);
if (iog != NULL)
io_group_chain_link(q, key, cgroup, iog, efqd);
@@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
goto out;
}
- iog = io_find_alloc_group(q, cgroup, efqd, create);
+ iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
if (!iog) {
if (create)
iog = efqd->root_group;
@@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
return iog;
}
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *pn;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+
+ if (list_empty(&iocg->policy_list))
+ goto out;
+
+ seq_printf(m, "dev weight class\n");
+
+ spin_lock_irq(&iocg->lock);
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+ pn->weight, pn->ioprio_class);
+ }
+ spin_unlock_irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+ struct io_policy_node *pn)
+{
+ list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+ list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev)
+{
+ struct io_policy_node *pn;
+
+ if (list_empty(&iocg->policy_list))
+ return NULL;
+
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ if (pn->dev == dev)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static int devname_to_devnum(const char *buf, dev_t *dev)
+{
+ struct block_device *bdev;
+ struct gendisk *disk;
+ int part;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return -ENODEV;
+
+ disk = get_gendisk(bdev->bd_dev, &part);
+ *dev = MKDEV(disk->major, disk->first_minor);
+ bdput(bdev);
+
+ return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+ char *s[3], *p;
+ int ret;
+ int i = 0;
+
+ memset(s, 0, sizeof(s));
+ while ((p = strsep(&buf, ":")) != NULL) {
+ if (!*p)
+ continue;
+ s[i++] = p;
+ }
+
+ ret = devname_to_devnum(s[0], &newpn->dev);
+ if (ret)
+ return ret;
+
+ strcpy(newpn->dev_name, s[0]);
+
+ if (s[1] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[1], 10, &newpn->weight);
+ if (ret || newpn->weight > WEIGHT_MAX)
+ return -EINVAL;
+
+ if (s[2] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+ if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+ newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep_newpn = 0;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ buf = kstrdup(buffer, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free_buf;
+ }
+
+ ret = policy_parse_and_set(buf, newpn);
+ if (ret)
+ goto free_newpn;
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto free_newpn;
+ }
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ spin_lock_irq(&iocg->lock);
+
+ pn = policy_search_node(iocg, newpn->dev);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy_insert_node(iocg, newpn);
+ keep_newpn = 1;
+ }
+ goto update_io_group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy_delete_node(pn);
+ goto update_io_group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+ if (iog->dev == newpn->dev) {
+ if (newpn->weight) {
+ iog->entity.new_weight = newpn->weight;
+ iog->entity.new_ioprio_class =
+ newpn->ioprio_class;
+ /*
+ * iog weight and ioprio_class updating
+ * actually happens if ioprio_changed is set.
+ * So ensure ioprio_changed is not set until
+ * new weight and new ioprio_class are updated.
+ */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ } else {
+ iog->entity.new_weight = iocg->weight;
+ iog->entity.new_ioprio_class =
+ iocg->ioprio_class;
+
+ /* The same as above */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ }
+ }
+ }
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+free_newpn:
+ if (!keep_newpn)
+ kfree(newpn);
+free_buf:
+ kfree(buf);
+ return ret;
+}
+
struct cftype bfqio_files[] = {
{
+ .name = "policy",
+ .read_seq_string = io_cgroup_policy_read,
+ .write_string = io_cgroup_policy_write,
+ .max_write_len = 256,
+ },
+ {
.name = "weight",
.read_u64 = io_cgroup_weight_read,
.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
INIT_HLIST_HEAD(&iocg->group_data);
iocg->weight = IO_DEFAULT_GRP_WEIGHT;
iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+ INIT_LIST_HEAD(&iocg->policy_list);
return &iocg->css;
}
@@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
unsigned long flags, flags1;
int queue_lock_held = 0;
struct elv_fq_data *efqd;
+ struct io_policy_node *pn, *pntmp;
/*
* io groups are linked in two lists. One list is maintained
@@ -1823,6 +2057,12 @@ locked:
BUG_ON(!hlist_empty(&iocg->group_data));
free_css_id(&io_subsys, &iocg->css);
+
+ list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+ policy_delete_node(pn);
+ kfree(pn);
+ }
+
kfree(iocg);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..b1d97e6 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -250,9 +250,18 @@ struct io_group {
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
unsigned short iocg_id;
+ dev_t dev;
#endif
};
+struct io_policy_node {
+ struct list_head node;
+ char dev_name[32];
+ dev_t dev;
+ unsigned long weight;
+ unsigned long ioprio_class;
+};
+
/**
* struct bfqio_cgroup - bfq cgroup data structure.
* @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +278,9 @@ struct io_cgroup {
unsigned long weight, ioprio_class;
+ /* list of io_policy_node */
+ struct list_head policy_list;
+
spinlock_t lock;
struct hlist_head group_data;
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-13 19:09 ` Vivek Goyal
2009-05-14 1:35 ` Gui Jianfeng
[not found] ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14 7:26 ` Gui Jianfeng
2009-05-14 15:15 ` Vivek Goyal
` (2 more replies)
2 siblings, 3 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14 7:26 UTC (permalink / raw)
To: Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
Hi Vivek,
This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.
You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.
Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2
Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2
Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/elevator-fq.c | 258 +++++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 12 +++
2 files changed, 261 insertions(+), 9 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..43b30a4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
#include "elevator-fq.h"
#include <linux/blktrace_api.h>
#include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
/* Values taken from cfq */
const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
}
EXPORT_SYMBOL(io_lookup_io_group_current);
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+ dev_t dev)
{
struct io_entity *entity = &iog->entity;
+ struct io_policy_node *pn;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iocg->lock, flags);
+ pn = policy_search_node(iocg, dev);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->new_weight = pn->weight;
+ entity->ioprio_class = pn->ioprio_class;
+ entity->new_ioprio_class = pn->ioprio_class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->new_weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ entity->new_ioprio_class = iocg->ioprio_class;
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);
- entity->weight = entity->new_weight = iocg->weight;
- entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
entity->ioprio_changed = 1;
entity->my_sched_data = &iog->sched_data;
}
@@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
struct io_cgroup *iocg; \
struct io_group *iog; \
struct hlist_node *n; \
+ struct io_policy_node *pn; \
\
if (val < (__MIN) || val > (__MAX)) \
return -EINVAL; \
@@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
spin_lock_irq(&iocg->lock); \
iocg->__VAR = (unsigned long)val; \
hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ pn = policy_search_node(iocg, iog->dev); \
+ if (pn) \
+ continue; \
iog->entity.new_##__VAR = (unsigned long)val; \
smp_wmb(); \
iog->entity.ioprio_changed = 1; \
@@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
* to the root has already an allocated group on @bfqd.
*/
struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
- struct cgroup *cgroup)
+ struct cgroup *cgroup, struct bio *bio)
{
struct io_cgroup *iocg;
struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
atomic_set(&iog->ref, 0);
iog->deleting = 0;
- io_group_init_entity(iocg, iog);
- iog->my_entity = &iog->entity;
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
iog->iocg_id = css_id(&iocg->css);
+ if (bio) {
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ iog->dev = MKDEV(disk->major, disk->first_minor);
+ }
#endif
+ io_group_init_entity(iocg, iog, iog->dev);
+ iog->my_entity = &iog->entity;
+
blk_init_request_list(&iog->rl);
if (leaf == NULL) {
@@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
*/
struct io_group *io_find_alloc_group(struct request_queue *q,
struct cgroup *cgroup, struct elv_fq_data *efqd,
- int create)
+ int create, struct bio *bio)
{
struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
struct io_group *iog = NULL;
@@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
if (iog != NULL || !create)
return iog;
- iog = io_group_chain_alloc(q, key, cgroup);
+ iog = io_group_chain_alloc(q, key, cgroup, bio);
if (iog != NULL)
io_group_chain_link(q, key, cgroup, iog, efqd);
@@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
goto out;
}
- iog = io_find_alloc_group(q, cgroup, efqd, create);
+ iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
if (!iog) {
if (create)
iog = efqd->root_group;
@@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
return iog;
}
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *pn;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+
+ if (list_empty(&iocg->policy_list))
+ goto out;
+
+ seq_printf(m, "dev weight class\n");
+
+ spin_lock_irq(&iocg->lock);
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+ pn->weight, pn->ioprio_class);
+ }
+ spin_unlock_irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+ struct io_policy_node *pn)
+{
+ list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+ list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev)
+{
+ struct io_policy_node *pn;
+
+ if (list_empty(&iocg->policy_list))
+ return NULL;
+
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ if (pn->dev == dev)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static int devname_to_devnum(const char *buf, dev_t *dev)
+{
+ struct block_device *bdev;
+ struct gendisk *disk;
+ int part;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return -ENODEV;
+
+ disk = get_gendisk(bdev->bd_dev, &part);
+ *dev = MKDEV(disk->major, disk->first_minor);
+ bdput(bdev);
+
+ return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+ char *s[3], *p;
+ int ret;
+ int i = 0;
+
+ memset(s, 0, sizeof(s));
+ while ((p = strsep(&buf, ":")) != NULL) {
+ if (!*p)
+ continue;
+ s[i++] = p;
+ }
+
+ ret = devname_to_devnum(s[0], &newpn->dev);
+ if (ret)
+ return ret;
+
+ strcpy(newpn->dev_name, s[0]);
+
+ if (s[1] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[1], 10, &newpn->weight);
+ if (ret || newpn->weight > WEIGHT_MAX)
+ return -EINVAL;
+
+ if (s[2] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+ if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+ newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep_newpn = 0;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ buf = kstrdup(buffer, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free_buf;
+ }
+
+ ret = policy_parse_and_set(buf, newpn);
+ if (ret)
+ goto free_newpn;
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto free_newpn;
+ }
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ spin_lock_irq(&iocg->lock);
+
+ pn = policy_search_node(iocg, newpn->dev);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy_insert_node(iocg, newpn);
+ keep_newpn = 1;
+ }
+ goto update_io_group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy_delete_node(pn);
+ goto update_io_group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+ if (iog->dev == newpn->dev) {
+ if (newpn->weight) {
+ iog->entity.new_weight = newpn->weight;
+ iog->entity.new_ioprio_class =
+ newpn->ioprio_class;
+ /*
+ * iog weight and ioprio_class updating
+ * actually happens if ioprio_changed is set.
+ * So ensure ioprio_changed is not set until
+ * new weight and new ioprio_class are updated.
+ */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ } else {
+ iog->entity.new_weight = iocg->weight;
+ iog->entity.new_ioprio_class =
+ iocg->ioprio_class;
+
+ /* The same as above */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ }
+ }
+ }
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+free_newpn:
+ if (!keep_newpn)
+ kfree(newpn);
+free_buf:
+ kfree(buf);
+ return ret;
+}
+
struct cftype bfqio_files[] = {
{
+ .name = "policy",
+ .read_seq_string = io_cgroup_policy_read,
+ .write_string = io_cgroup_policy_write,
+ .max_write_len = 256,
+ },
+ {
.name = "weight",
.read_u64 = io_cgroup_weight_read,
.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
INIT_HLIST_HEAD(&iocg->group_data);
iocg->weight = IO_DEFAULT_GRP_WEIGHT;
iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+ INIT_LIST_HEAD(&iocg->policy_list);
return &iocg->css;
}
@@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
unsigned long flags, flags1;
int queue_lock_held = 0;
struct elv_fq_data *efqd;
+ struct io_policy_node *pn, *pntmp;
/*
* io groups are linked in two lists. One list is maintained
@@ -1823,6 +2057,12 @@ locked:
BUG_ON(!hlist_empty(&iocg->group_data));
free_css_id(&io_subsys, &iocg->css);
+
+ list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+ policy_delete_node(pn);
+ kfree(pn);
+ }
+
kfree(iocg);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..b1d97e6 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -250,9 +250,18 @@ struct io_group {
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
unsigned short iocg_id;
+ dev_t dev;
#endif
};
+struct io_policy_node {
+ struct list_head node;
+ char dev_name[32];
+ dev_t dev;
+ unsigned long weight;
+ unsigned long ioprio_class;
+};
+
/**
* struct bfqio_cgroup - bfq cgroup data structure.
* @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +278,9 @@ struct io_cgroup {
unsigned long weight, ioprio_class;
+ /* list of io_policy_node */
+ struct list_head policy_list;
+
spinlock_t lock;
struct hlist_head group_data;
};
--
1.5.4.rc3
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-14 7:26 ` Gui Jianfeng
@ 2009-05-14 15:15 ` Vivek Goyal
2009-05-18 22:33 ` IKEDA, Munehiro
[not found] ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-14 15:15 UTC (permalink / raw)
To: Gui Jianfeng
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, m-ikeda, akpm
On Thu, May 14, 2009 at 03:26:35PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Changelog (v1 -> v2)
> - Rename some structures
> - Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
> from enabling the interrupts unconditionally.
> - Fix policy setup bug when switching to another io scheduler.
> - If a policy is available for a specific device, don't update weight and
> io class when writing "weight" and "iprio_class".
> - Fix a bug when parsing policy string.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
Thanks a lot Gui. This patch seems to be working fine for me now. I will
continue to do more testing and let you know if there are more issues. I
will include it in next posting (V3).
Thanks
Vivek
> block/elevator-fq.c | 258 +++++++++++++++++++++++++++++++++++++++++++++++++--
> block/elevator-fq.h | 12 +++
> 2 files changed, 261 insertions(+), 9 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..43b30a4 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> + dev_t dev);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + dev_t dev)
> {
> struct io_entity *entity = &iog->entity;
> + struct io_policy_node *pn;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&iocg->lock, flags);
> + pn = policy_search_node(iocg, dev);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irqrestore(&iocg->lock, flags);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> struct io_cgroup *iocg; \
> struct io_group *iog; \
> struct hlist_node *n; \
> + struct io_policy_node *pn; \
> \
> if (val < (__MIN) || val > (__MAX)) \
> return -EINVAL; \
> @@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> spin_lock_irq(&iocg->lock); \
> iocg->__VAR = (unsigned long)val; \
> hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> + pn = policy_search_node(iocg, iog->dev); \
> + if (pn) \
> + continue; \
> iog->entity.new_##__VAR = (unsigned long)val; \
> smp_wmb(); \
> iog->entity.ioprio_changed = 1; \
> @@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
> * to the root has already an allocated group on @bfqd.
> */
> struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> - struct cgroup *cgroup)
> + struct cgroup *cgroup, struct bio *bio)
> {
> struct io_cgroup *iocg;
> struct io_group *iog, *leaf = NULL, *prev = NULL;
> @@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> - iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> + if (bio) {
> + struct gendisk *disk = bio->bi_bdev->bd_disk;
> + iog->dev = MKDEV(disk->major, disk->first_minor);
> + }
> #endif
>
> + io_group_init_entity(iocg, iog, iog->dev);
> + iog->my_entity = &iog->entity;
> +
> blk_init_request_list(&iog->rl);
>
> if (leaf == NULL) {
> @@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
> */
> struct io_group *io_find_alloc_group(struct request_queue *q,
> struct cgroup *cgroup, struct elv_fq_data *efqd,
> - int create)
> + int create, struct bio *bio)
> {
> struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> struct io_group *iog = NULL;
> @@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
> if (iog != NULL || !create)
> return iog;
>
> - iog = io_group_chain_alloc(q, key, cgroup);
> + iog = io_group_chain_alloc(q, key, cgroup, bio);
> if (iog != NULL)
> io_group_chain_link(q, key, cgroup, iog, efqd);
>
> @@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> goto out;
> }
>
> - iog = io_find_alloc_group(q, cgroup, efqd, create);
> + iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
> if (!iog) {
> if (create)
> iog = efqd->root_group;
> @@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct io_policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->policy_list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->policy_list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct io_policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->policy_list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct io_policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> + dev_t dev)
> +{
> + struct io_policy_node *pn;
> +
> + if (list_empty(&iocg->policy_list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->policy_list, node) {
> + if (pn->dev == dev)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static int devname_to_devnum(const char *buf, dev_t *dev)
> +{
> + struct block_device *bdev;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return -ENODEV;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + *dev = MKDEV(disk->major, disk->first_minor);
> + bdput(bdev);
> +
> + return 0;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
> +{
> + char *s[3], *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while ((p = strsep(&buf, ":")) != NULL) {
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + ret = devname_to_devnum(s[0], &newpn->dev);
> + if (ret)
> + return ret;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + if (s[1] == NULL)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + if (s[2] == NULL)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct io_policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->dev);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->dev == newpn->dev) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->policy_list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct io_policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2057,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..b1d97e6 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -250,9 +250,18 @@ struct io_group {
>
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> unsigned short iocg_id;
> + dev_t dev;
> #endif
> };
>
> +struct io_policy_node {
> + struct list_head node;
> + char dev_name[32];
> + dev_t dev;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +278,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of io_policy_node */
> + struct list_head policy_list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-14 7:26 ` Gui Jianfeng
2009-05-14 15:15 ` Vivek Goyal
@ 2009-05-18 22:33 ` IKEDA, Munehiro
2009-05-20 1:44 ` Gui Jianfeng
[not found] ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
[not found] ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2 siblings, 2 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-18 22:33 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
agk, dm-devel, snitzer, akpm
Hi Gui,
Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
Users can specify a device file of a partition for io.policy.
In this case, io_policy_node::dev_name is set as a name of the
partition device like /dev/sda2.
ex)
# cd /mnt/cgroup
# cat /dev/sda2:500:2 > io.policy
# echo io.policy
dev weight class
/dev/sda2 500 2
I believe io_policy_node::dev_name should be set a generic
device name like /dev/sda.
What do you think about it?
Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
block/elevator-fq.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39fa2a1..5d3d55c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
return NULL;
}
-static int devname_to_devnum(const char *buf, dev_t *dev)
+static int devname_to_devnum(char *buf, dev_t *dev)
{
struct block_device *bdev;
struct gendisk *disk;
int part;
+ char *c;
bdev = lookup_bdev(buf);
if (IS_ERR(bdev))
@@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
*dev = MKDEV(disk->major, disk->first_minor);
bdput(bdev);
+ c = strrchr(buf, '/');
+ if (c)
+ strcpy(c+1, disk->disk_name);
+
return 0;
}
--
1.5.4.3
--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda@ds.jp.nec.com
^ permalink raw reply related [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
2009-05-18 22:33 ` IKEDA, Munehiro
@ 2009-05-20 1:44 ` Gui Jianfeng
[not found] ` <4A136090.5090705-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
[not found] ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
1 sibling, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-20 1:44 UTC (permalink / raw)
To: IKEDA, Munehiro, Vivek Goyal
Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
snitzer, akpm
IKEDA, Munehiro wrote:
> Hi Gui,
>
> Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class
>> handling.
>> A new cgroup interface "policy" is introduced. You can make use of
>> this file to configure weight and ioprio_class for each device in a
>> given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If
>> you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>
> Users can specify a device file of a partition for io.policy.
> In this case, io_policy_node::dev_name is set as a name of the
> partition device like /dev/sda2.
>
> ex)
> # cd /mnt/cgroup
> # cat /dev/sda2:500:2 > io.policy
> # echo io.policy
> dev weight class
> /dev/sda2 500 2
>
> I believe io_policy_node::dev_name should be set a generic
> device name like /dev/sda.
> What do you think about it?
Hi Ikeda-san,
Sorry for the late reply. Thanks for pointing this out.
yes, it does the right thing but shows a wrong name.
IMHO, Inputing a sigle partition should not be allowed since the
policy is disk basis. So how about the following patch?
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1a0ca07..b620768 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
return -ENODEV;
disk = get_gendisk(bdev->bd_dev, &part);
+ if (part)
+ return -EINVAL;
+
*dev = MKDEV(disk->major, disk->first_minor);
bdput(bdev);
>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
> block/elevator-fq.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 39fa2a1..5d3d55c 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,11 +1631,12 @@ static struct io_policy_node
> *policy_search_node(const struct io_cgroup *iocg,
> return NULL;
> }
>
> -static int devname_to_devnum(const char *buf, dev_t *dev)
> +static int devname_to_devnum(char *buf, dev_t *dev)
> {
> struct block_device *bdev;
> struct gendisk *disk;
> int part;
> + char *c;
>
> bdev = lookup_bdev(buf);
> if (IS_ERR(bdev))
> @@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf,
> dev_t *dev)
> *dev = MKDEV(disk->major, disk->first_minor);
> bdput(bdev);
>
> + c = strrchr(buf, '/');
> + if (c)
> + strcpy(c+1, disk->disk_name);
> +
> return 0;
> }
>
--
Regards
Gui Jianfeng
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-05-20 1:44 ` Gui Jianfeng
0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-20 1:44 UTC (permalink / raw)
To: IKEDA, Munehiro, Vivek Goyal
Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w, snitzer-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
IKEDA, Munehiro wrote:
> Hi Gui,
>
> Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class
>> handling.
>> A new cgroup interface "policy" is introduced. You can make use of
>> this file to configure weight and ioprio_class for each device in a
>> given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If
>> you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>
> Users can specify a device file of a partition for io.policy.
> In this case, io_policy_node::dev_name is set as a name of the
> partition device like /dev/sda2.
>
> ex)
> # cd /mnt/cgroup
> # cat /dev/sda2:500:2 > io.policy
> # echo io.policy
> dev weight class
> /dev/sda2 500 2
>
> I believe io_policy_node::dev_name should be set a generic
> device name like /dev/sda.
> What do you think about it?
Hi Ikeda-san,
Sorry for the late reply. Thanks for pointing this out.
yes, it does the right thing but shows a wrong name.
IMHO, Inputing a sigle partition should not be allowed since the
policy is disk basis. So how about the following patch?
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1a0ca07..b620768 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
return -ENODEV;
disk = get_gendisk(bdev->bd_dev, &part);
+ if (part)
+ return -EINVAL;
+
*dev = MKDEV(disk->major, disk->first_minor);
bdput(bdev);
>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
> ---
> block/elevator-fq.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 39fa2a1..5d3d55c 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,11 +1631,12 @@ static struct io_policy_node
> *policy_search_node(const struct io_cgroup *iocg,
> return NULL;
> }
>
> -static int devname_to_devnum(const char *buf, dev_t *dev)
> +static int devname_to_devnum(char *buf, dev_t *dev)
> {
> struct block_device *bdev;
> struct gendisk *disk;
> int part;
> + char *c;
>
> bdev = lookup_bdev(buf);
> if (IS_ERR(bdev))
> @@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf,
> dev_t *dev)
> *dev = MKDEV(disk->major, disk->first_minor);
> bdput(bdev);
>
> + c = strrchr(buf, '/');
> + if (c)
> + strcpy(c+1, disk->disk_name);
> +
> return 0;
> }
>
--
Regards
Gui Jianfeng
^ permalink raw reply related [flat|nested] 297+ messages in thread
[parent not found: <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-14 15:15 ` Vivek Goyal
2009-05-18 22:33 ` IKEDA, Munehiro
1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-14 15:15 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
On Thu, May 14, 2009 at 03:26:35PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
>
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
>
> Changelog (v1 -> v2)
> - Rename some structures
> - Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
> from enabling the interrupts unconditionally.
> - Fix policy setup bug when switching to another io scheduler.
> - If a policy is available for a specific device, don't update weight and
> io class when writing "weight" and "iprio_class".
> - Fix a bug when parsing policy string.
>
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
Thanks a lot Gui. This patch seems to be working fine for me now. I will
continue to do more testing and let you know if there are more issues. I
will include it in next posting (V3).
Thanks
Vivek
> block/elevator-fq.c | 258 +++++++++++++++++++++++++++++++++++++++++++++++++--
> block/elevator-fq.h | 12 +++
> 2 files changed, 261 insertions(+), 9 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..43b30a4 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
> #include "elevator-fq.h"
> #include <linux/blktrace_api.h>
> #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>
> /* Values taken from cfq */
> const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> + dev_t dev);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> + dev_t dev)
> {
> struct io_entity *entity = &iog->entity;
> + struct io_policy_node *pn;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&iocg->lock, flags);
> + pn = policy_search_node(iocg, dev);
> + if (pn) {
> + entity->weight = pn->weight;
> + entity->new_weight = pn->weight;
> + entity->ioprio_class = pn->ioprio_class;
> + entity->new_ioprio_class = pn->ioprio_class;
> + } else {
> + entity->weight = iocg->weight;
> + entity->new_weight = iocg->weight;
> + entity->ioprio_class = iocg->ioprio_class;
> + entity->new_ioprio_class = iocg->ioprio_class;
> + }
> + spin_unlock_irqrestore(&iocg->lock, flags);
>
> - entity->weight = entity->new_weight = iocg->weight;
> - entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
> entity->ioprio_changed = 1;
> entity->my_sched_data = &iog->sched_data;
> }
> @@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> struct io_cgroup *iocg; \
> struct io_group *iog; \
> struct hlist_node *n; \
> + struct io_policy_node *pn; \
> \
> if (val < (__MIN) || val > (__MAX)) \
> return -EINVAL; \
> @@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
> spin_lock_irq(&iocg->lock); \
> iocg->__VAR = (unsigned long)val; \
> hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
> + pn = policy_search_node(iocg, iog->dev); \
> + if (pn) \
> + continue; \
> iog->entity.new_##__VAR = (unsigned long)val; \
> smp_wmb(); \
> iog->entity.ioprio_changed = 1; \
> @@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
> * to the root has already an allocated group on @bfqd.
> */
> struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> - struct cgroup *cgroup)
> + struct cgroup *cgroup, struct bio *bio)
> {
> struct io_cgroup *iocg;
> struct io_group *iog, *leaf = NULL, *prev = NULL;
> @@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> atomic_set(&iog->ref, 0);
> iog->deleting = 0;
>
> - io_group_init_entity(iocg, iog);
> - iog->my_entity = &iog->entity;
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> iog->iocg_id = css_id(&iocg->css);
> + if (bio) {
> + struct gendisk *disk = bio->bi_bdev->bd_disk;
> + iog->dev = MKDEV(disk->major, disk->first_minor);
> + }
> #endif
>
> + io_group_init_entity(iocg, iog, iog->dev);
> + iog->my_entity = &iog->entity;
> +
> blk_init_request_list(&iog->rl);
>
> if (leaf == NULL) {
> @@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
> */
> struct io_group *io_find_alloc_group(struct request_queue *q,
> struct cgroup *cgroup, struct elv_fq_data *efqd,
> - int create)
> + int create, struct bio *bio)
> {
> struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> struct io_group *iog = NULL;
> @@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
> if (iog != NULL || !create)
> return iog;
>
> - iog = io_group_chain_alloc(q, key, cgroup);
> + iog = io_group_chain_alloc(q, key, cgroup, bio);
> if (iog != NULL)
> io_group_chain_link(q, key, cgroup, iog, efqd);
>
> @@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> goto out;
> }
>
> - iog = io_find_alloc_group(q, cgroup, efqd, create);
> + iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
> if (!iog) {
> if (create)
> iog = efqd->root_group;
> @@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> return iog;
> }
>
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct io_cgroup *iocg;
> + struct io_policy_node *pn;
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> +
> + if (list_empty(&iocg->policy_list))
> + goto out;
> +
> + seq_printf(m, "dev weight class\n");
> +
> + spin_lock_irq(&iocg->lock);
> + list_for_each_entry(pn, &iocg->policy_list, node) {
> + seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> + pn->weight, pn->ioprio_class);
> + }
> + spin_unlock_irq(&iocg->lock);
> +out:
> + return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> + struct io_policy_node *pn)
> +{
> + list_add(&pn->node, &iocg->policy_list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct io_policy_node *pn)
> +{
> + list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> + dev_t dev)
> +{
> + struct io_policy_node *pn;
> +
> + if (list_empty(&iocg->policy_list))
> + return NULL;
> +
> + list_for_each_entry(pn, &iocg->policy_list, node) {
> + if (pn->dev == dev)
> + return pn;
> + }
> +
> + return NULL;
> +}
> +
> +static int devname_to_devnum(const char *buf, dev_t *dev)
> +{
> + struct block_device *bdev;
> + struct gendisk *disk;
> + int part;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return -ENODEV;
> +
> + disk = get_gendisk(bdev->bd_dev, &part);
> + *dev = MKDEV(disk->major, disk->first_minor);
> + bdput(bdev);
> +
> + return 0;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
> +{
> + char *s[3], *p;
> + int ret;
> + int i = 0;
> +
> + memset(s, 0, sizeof(s));
> + while ((p = strsep(&buf, ":")) != NULL) {
> + if (!*p)
> + continue;
> + s[i++] = p;
> + }
> +
> + ret = devname_to_devnum(s[0], &newpn->dev);
> + if (ret)
> + return ret;
> +
> + strcpy(newpn->dev_name, s[0]);
> +
> + if (s[1] == NULL)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[1], 10, &newpn->weight);
> + if (ret || newpn->weight > WEIGHT_MAX)
> + return -EINVAL;
> +
> + if (s[2] == NULL)
> + return -EINVAL;
> +
> + ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> + if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> + newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct io_cgroup *iocg;
> + struct io_policy_node *newpn, *pn;
> + char *buf;
> + int ret = 0;
> + int keep_newpn = 0;
> + struct hlist_node *n;
> + struct io_group *iog;
> +
> + buf = kstrdup(buffer, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> + if (!newpn) {
> + ret = -ENOMEM;
> + goto free_buf;
> + }
> +
> + ret = policy_parse_and_set(buf, newpn);
> + if (ret)
> + goto free_newpn;
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto free_newpn;
> + }
> +
> + iocg = cgroup_to_io_cgroup(cgrp);
> + spin_lock_irq(&iocg->lock);
> +
> + pn = policy_search_node(iocg, newpn->dev);
> + if (!pn) {
> + if (newpn->weight != 0) {
> + policy_insert_node(iocg, newpn);
> + keep_newpn = 1;
> + }
> + goto update_io_group;
> + }
> +
> + if (newpn->weight == 0) {
> + /* weight == 0 means deleteing a policy */
> + policy_delete_node(pn);
> + goto update_io_group;
> + }
> +
> + pn->weight = newpn->weight;
> + pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> + hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> + if (iog->dev == newpn->dev) {
> + if (newpn->weight) {
> + iog->entity.new_weight = newpn->weight;
> + iog->entity.new_ioprio_class =
> + newpn->ioprio_class;
> + /*
> + * iog weight and ioprio_class updating
> + * actually happens if ioprio_changed is set.
> + * So ensure ioprio_changed is not set until
> + * new weight and new ioprio_class are updated.
> + */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + } else {
> + iog->entity.new_weight = iocg->weight;
> + iog->entity.new_ioprio_class =
> + iocg->ioprio_class;
> +
> + /* The same as above */
> + smp_wmb();
> + iog->entity.ioprio_changed = 1;
> + }
> + }
> + }
> + spin_unlock_irq(&iocg->lock);
> +
> + cgroup_unlock();
> +
> +free_newpn:
> + if (!keep_newpn)
> + kfree(newpn);
> +free_buf:
> + kfree(buf);
> + return ret;
> +}
> +
> struct cftype bfqio_files[] = {
> {
> + .name = "policy",
> + .read_seq_string = io_cgroup_policy_read,
> + .write_string = io_cgroup_policy_write,
> + .max_write_len = 256,
> + },
> + {
> .name = "weight",
> .read_u64 = io_cgroup_weight_read,
> .write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
> INIT_HLIST_HEAD(&iocg->group_data);
> iocg->weight = IO_DEFAULT_GRP_WEIGHT;
> iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> + INIT_LIST_HEAD(&iocg->policy_list);
>
> return &iocg->css;
> }
> @@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> unsigned long flags, flags1;
> int queue_lock_held = 0;
> struct elv_fq_data *efqd;
> + struct io_policy_node *pn, *pntmp;
>
> /*
> * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2057,12 @@ locked:
> BUG_ON(!hlist_empty(&iocg->group_data));
>
> free_css_id(&io_subsys, &iocg->css);
> +
> + list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
> + policy_delete_node(pn);
> + kfree(pn);
> + }
> +
> kfree(iocg);
> }
>
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..b1d97e6 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -250,9 +250,18 @@ struct io_group {
>
> #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> unsigned short iocg_id;
> + dev_t dev;
> #endif
> };
>
> +struct io_policy_node {
> + struct list_head node;
> + char dev_name[32];
> + dev_t dev;
> + unsigned long weight;
> + unsigned long ioprio_class;
> +};
> +
> /**
> * struct bfqio_cgroup - bfq cgroup data structure.
> * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +278,9 @@ struct io_cgroup {
>
> unsigned long weight, ioprio_class;
>
> + /* list of io_policy_node */
> + struct list_head policy_list;
> +
> spinlock_t lock;
> struct hlist_head group_data;
> };
> --
> 1.5.4.rc3
>
>
^ permalink raw reply [flat|nested] 297+ messages in thread
* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
[not found] ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-14 15:15 ` Vivek Goyal
@ 2009-05-18 22:33 ` IKEDA, Munehiro
1 sibling, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-18 22:33 UTC (permalink / raw)
To: Gui Jianfeng
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Hi Gui,
Gui Jianfeng wrote:
> Hi Vivek,
>
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and
> "ioprio_class" are used as default values in this device.
>
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
>
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
Users can specify a device file of a partition for io.policy.
In this case, io_policy_node::dev_name is set as a name of the
partition device like /dev/sda2.
ex)
# cd /mnt/cgroup
# cat /dev/sda2:500:2 > io.policy
# echo io.policy
dev weight class
/dev/sda2 500 2
I believe io_policy_node::dev_name should be set a generic
device name like /dev/sda.
What do you think about it?
Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
block/elevator-fq.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39fa2a1..5d3d55c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
return NULL;
}
-static int devname_to_devnum(const char *buf, dev_t *dev)
+static int devname_to_devnum(char *buf, dev_t *dev)
{
struct block_device *bdev;
struct gendisk *disk;
int part;
+ char *c;
bdev = lookup_bdev(buf);
if (IS_ERR(bdev))
@@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
*dev = MKDEV(disk->major, disk->first_minor);
bdput(bdev);
+ c = strrchr(buf, '/');
+ if (c)
+ strcpy(c+1, disk->disk_name);
+
return 0;
}
--
1.5.4.3
--
IKEDA, Munehiro
NEC Corporation of America
m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org
^ permalink raw reply related [flat|nested] 297+ messages in thread