* [PATCH 0/9] cgroup: io-throttle controller (v13)
@ 2009-04-14 20:21 Andrea Righi
0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2009-04-14 20:21 UTC (permalink / raw)
To: Paul Menage
Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Carl Henrik Lunde,
eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w, Balbir Singh,
fernando-gVGce1chcLdL9jVzuh4AOg, dradford-cT2on/YLNlBWk0Htik3J/w,
agk-9JcytcrH/bA+uJoB2kUjGw,
subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
axboe-tSWWG44O7X1aa/9Udqfwiw,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
ngupta-hpIqsD4AKlfQT0dZR+AlfA
Objective
~~~~~~~~~
The objective of the io-throttle controller is to improve IO performance
predictability of different cgroups that share the same block devices.
State of the art (quick overview)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A recent work made by Vivek propose a weighted BW solution introducing
fair queuing support in the elevator layer and modifying the existent IO
schedulers to use that functionality
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html).
For the fair queuing part Vivek's IO controller makes use of the BFQ
code as posted by Paolo and Fabio (http://lkml.org/lkml/2008/11/11/148).
The dm-ioband controller by the valinux guys is also proposing a
proportional ticket-based solution fully implemented at the device
mapper level (http://people.valinux.co.jp/~ryov/dm-ioband/).
The bio-cgroup patch (http://people.valinux.co.jp/~ryov/bio-cgroup/) is
a BIO tracking mechanism for cgroups, implemented in the cgroup memory
subsystem. It is maintained by Ryo and it allows dm-ioband to track
writeback requests issued by kernel threads (pdflush).
Another work by Satoshi implements the cgroup awareness in CFQ, mapping
per-cgroup priority to CFQ IO priorities and this also provide only the
proportional BW support (http://lwn.net/Articles/306772/).
Please correct me or integrate if I missed someone or something. :)
Proposed solution
~~~~~~~~~~~~~~~~~
Respect to other priority/weight-based solutions the approach used by
this controller is to explicitly choke applications' requests that
directly or indirectly generate IO activity in the system (this
controller addresses both synchronous IO and writeback/buffered IO).
The bandwidth and iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the
overall performance of the system in terms of throughput.
IO throttling and accounting is performed during the submission of IO
requests and it is independent of the particular IO scheduler.
Detailed informations about design, goal and usage are described in the
documentation (see [PATCH 1/9]).
Implementation
~~~~~~~~~~~~~~
Patchset against latest Linus' git:
[PATCH 0/9] cgroup: block device IO controller (v13)
[PATCH 1/9] io-throttle documentation
[PATCH 2/9] res_counter: introduce ratelimiting attributes
[PATCH 3/9] bio-cgroup controller
[PATCH 4/9] support checking of cgroup subsystem dependencies
[PATCH 5/9] io-throttle controller infrastructure
[PATCH 6/9] kiothrottled: throttle buffered (writeback) IO
[PATCH 7/9] io-throttle instrumentation
[PATCH 8/9] export per-task io-throttle statistics to userspace
[PATCH 9/9] ext3: do not throttle metadata and journal IO
The v13 all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
There are some consistent changes in this patchset respect to the
previous version.
Thanks to the Gui Jianfeng's contribution the io-throttle controller now
uses bio-cgroup to track buffered (writeback) IO, instead of the memory
cgroup controller, and it is also possible to mount the memcg,
bio-cgroup and io-throttle in different mount points (see also
http://lwn.net/Articles/308108/).
Moreover, a kernel thread (kiothrottled) has been introduced to schedule
throttled writeback requests asynchronously. This allow to smooth the
bursty IO generated by the buch of pdflush's writeback requests. All
those requests are added into a rbtree and dispatched asynchronously by
kiothrottled using a deadline-based policy.
The kiothrottled scheduler can be improved in future versions to
implement a proportional/weighted IO scheduling, preferably with the
feedback of the existent IO schedulers.
Experimental results
~~~~~~~~~~~~~~~~~~~~
Following few quick experimental results with writeback IO. Results with
synchronous IO (read and write) are more or less the same obtained with
the previous io-throttle version.
Two cgroups:
cgroup-a: 4MB BW limit on /dev/sda
cgroup-b: 2MB BW limit on /dev/sda
Run 2 concurrent "dd"s (1 in cgroup-a, 1 in cgroup-b) to simulate a
large write stream and generate many writeback IO requests.
Expected results: 6MB/s from the disk's point of view, 4MB/s and 2MB/s
from the application's point of view.
Experimental results:
* From the disk's point of view (dstat -d -D sda1):
with kiothrottled without kiothrottled
--dsk/sda1- --dsk/sda1-
read writ read writ
0 6252k 0 9688k
0 6904k 0 6488k
0 6320k 0 2320k
0 6144k 0 8192k
0 6220k 0 10M
0 6212k 0 5208k
0 6228k 0 1940k
0 6212k 0 1300k
0 6312k 0 8100k
0 6216k 0 8640k
0 6228k 0 6584k
0 6648k 0 2440k
... ...
----- ----
avg: 6325k avg: 5928k
* From the application's point of view:
- with kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
196+0 records in
196+0 records out
205520896 bytes (206 MB) copied, 40.762 s, 5.0 MB/s
cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
97+0 records in
97+0 records out
101711872 bytes (102 MB) copied, 37.3826 s, 2.7 MB/s
- without kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
133+0 records in
133+0 records out
139460608 bytes (139 MB) copied, 39.1345 s, 3.6 MB/s
cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 39.0422 s, 1.9 MB/s
Changelog (v12 -> v13)
~~~~~~~~~~~~~~~~~~~~~~
* rewritten on top of bio-cgroup to track writeback IO
* now it is possible to mount memory, bio-cgroup and io-throttle cgroups in
different mount points
* introduce a dedicated kernel thread (kiothrottled) to throttle writeback IO
* updated documentation
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 0/9] cgroup: io-throttle controller (v13)
@ 2009-04-14 20:21 Andrea Righi
[not found] ` <1239740480-28125-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-04-30 13:20 ` Alan D. Brunelle
0 siblings, 2 replies; 10+ messages in thread
From: Andrea Righi @ 2009-04-14 20:21 UTC (permalink / raw)
To: Paul Menage
Cc: Balbir Singh, Gui Jianfeng, KAMEZAWA Hiroyuki, agk, akpm, axboe,
baramsori72, Carl Henrik Lunde, dave, Divyesh Shah, eric.rannaud,
fernando, Hirokazu Takahashi, Li Zefan, matt, dradford, ngupta,
randy.dunlap, roberto, Ryo Tsuruta, Satoshi UCHIDA, subrata,
yoshikawa.takuya, containers, linux-kernel
Objective
~~~~~~~~~
The objective of the io-throttle controller is to improve IO performance
predictability of different cgroups that share the same block devices.
State of the art (quick overview)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A recent work made by Vivek propose a weighted BW solution introducing
fair queuing support in the elevator layer and modifying the existent IO
schedulers to use that functionality
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html).
For the fair queuing part Vivek's IO controller makes use of the BFQ
code as posted by Paolo and Fabio (http://lkml.org/lkml/2008/11/11/148).
The dm-ioband controller by the valinux guys is also proposing a
proportional ticket-based solution fully implemented at the device
mapper level (http://people.valinux.co.jp/~ryov/dm-ioband/).
The bio-cgroup patch (http://people.valinux.co.jp/~ryov/bio-cgroup/) is
a BIO tracking mechanism for cgroups, implemented in the cgroup memory
subsystem. It is maintained by Ryo and it allows dm-ioband to track
writeback requests issued by kernel threads (pdflush).
Another work by Satoshi implements the cgroup awareness in CFQ, mapping
per-cgroup priority to CFQ IO priorities and this also provide only the
proportional BW support (http://lwn.net/Articles/306772/).
Please correct me or integrate if I missed someone or something. :)
Proposed solution
~~~~~~~~~~~~~~~~~
Respect to other priority/weight-based solutions the approach used by
this controller is to explicitly choke applications' requests that
directly or indirectly generate IO activity in the system (this
controller addresses both synchronous IO and writeback/buffered IO).
The bandwidth and iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the
overall performance of the system in terms of throughput.
IO throttling and accounting is performed during the submission of IO
requests and it is independent of the particular IO scheduler.
Detailed informations about design, goal and usage are described in the
documentation (see [PATCH 1/9]).
Implementation
~~~~~~~~~~~~~~
Patchset against latest Linus' git:
[PATCH 0/9] cgroup: block device IO controller (v13)
[PATCH 1/9] io-throttle documentation
[PATCH 2/9] res_counter: introduce ratelimiting attributes
[PATCH 3/9] bio-cgroup controller
[PATCH 4/9] support checking of cgroup subsystem dependencies
[PATCH 5/9] io-throttle controller infrastructure
[PATCH 6/9] kiothrottled: throttle buffered (writeback) IO
[PATCH 7/9] io-throttle instrumentation
[PATCH 8/9] export per-task io-throttle statistics to userspace
[PATCH 9/9] ext3: do not throttle metadata and journal IO
The v13 all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
There are some consistent changes in this patchset respect to the
previous version.
Thanks to the Gui Jianfeng's contribution the io-throttle controller now
uses bio-cgroup to track buffered (writeback) IO, instead of the memory
cgroup controller, and it is also possible to mount the memcg,
bio-cgroup and io-throttle in different mount points (see also
http://lwn.net/Articles/308108/).
Moreover, a kernel thread (kiothrottled) has been introduced to schedule
throttled writeback requests asynchronously. This allow to smooth the
bursty IO generated by the buch of pdflush's writeback requests. All
those requests are added into a rbtree and dispatched asynchronously by
kiothrottled using a deadline-based policy.
The kiothrottled scheduler can be improved in future versions to
implement a proportional/weighted IO scheduling, preferably with the
feedback of the existent IO schedulers.
Experimental results
~~~~~~~~~~~~~~~~~~~~
Following few quick experimental results with writeback IO. Results with
synchronous IO (read and write) are more or less the same obtained with
the previous io-throttle version.
Two cgroups:
cgroup-a: 4MB BW limit on /dev/sda
cgroup-b: 2MB BW limit on /dev/sda
Run 2 concurrent "dd"s (1 in cgroup-a, 1 in cgroup-b) to simulate a
large write stream and generate many writeback IO requests.
Expected results: 6MB/s from the disk's point of view, 4MB/s and 2MB/s
from the application's point of view.
Experimental results:
* From the disk's point of view (dstat -d -D sda1):
with kiothrottled without kiothrottled
--dsk/sda1- --dsk/sda1-
read writ read writ
0 6252k 0 9688k
0 6904k 0 6488k
0 6320k 0 2320k
0 6144k 0 8192k
0 6220k 0 10M
0 6212k 0 5208k
0 6228k 0 1940k
0 6212k 0 1300k
0 6312k 0 8100k
0 6216k 0 8640k
0 6228k 0 6584k
0 6648k 0 2440k
... ...
----- ----
avg: 6325k avg: 5928k
* From the application's point of view:
- with kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
196+0 records in
196+0 records out
205520896 bytes (206 MB) copied, 40.762 s, 5.0 MB/s
cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
97+0 records in
97+0 records out
101711872 bytes (102 MB) copied, 37.3826 s, 2.7 MB/s
- without kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
133+0 records in
133+0 records out
139460608 bytes (139 MB) copied, 39.1345 s, 3.6 MB/s
cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 39.0422 s, 1.9 MB/s
Changelog (v12 -> v13)
~~~~~~~~~~~~~~~~~~~~~~
* rewritten on top of bio-cgroup to track writeback IO
* now it is possible to mount memory, bio-cgroup and io-throttle cgroups in
different mount points
* introduce a dedicated kernel thread (kiothrottled) to throttle writeback IO
* updated documentation
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <1239740480-28125-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
2009-04-14 20:21 Andrea Righi
@ 2009-04-16 22:24 ` Andrew Morton
2009-04-30 13:20 ` Alan D. Brunelle
1 sibling, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2009-04-16 22:24 UTC (permalink / raw)
To: Andrea Righi
Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
menage-hpIqsD4AKlfQT0dZR+AlfA, chlunde-om2ZC0WAoZIXWF+eFR7m5Q,
eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
fernando-gVGce1chcLdL9jVzuh4AOg, dradford-cT2on/YLNlBWk0Htik3J/w,
agk-9JcytcrH/bA+uJoB2kUjGw,
subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
axboe-tSWWG44O7X1aa/9Udqfwiw,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
ngupta-hpIqsD4AKlfQT0dZR+AlfA
On Tue, 14 Apr 2009 22:21:11 +0200
Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Objective
> ~~~~~~~~~
> The objective of the io-throttle controller is to improve IO performance
> predictability of different cgroups that share the same block devices.
We should get an IO controller into Linux. Does anyone have a reason
why it shouldn't be this one?
> Respect to other priority/weight-based solutions the approach used by
> this controller is to explicitly choke applications' requests
Yes, blocking the offending application at a high level has always
seemed to me to be the best way of implementing the controller.
> that
> directly or indirectly generate IO activity in the system (this
> controller addresses both synchronous IO and writeback/buffered IO).
The problem I've seen with some of the proposed controllers was that
they didn't handle delayed writeback very well, if at all.
Can you explain at a high level but in some detail how this works? If
an application is doing a huge write(), how is that detected and how is
the application made to throttle?
Does it add new metadata to `struct page' for this?
I assume that the write throttling is also wired up into the MAP_SHARED
write-fault path?
Does this patchset provide a path by which we can implement IO control
for (say) NFS mounts?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
@ 2009-04-16 22:24 ` Andrew Morton
0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2009-04-16 22:24 UTC (permalink / raw)
To: Andrea Righi
Cc: menage, balbir, guijianfeng, kamezawa.hiroyu, agk, axboe,
baramsori72, chlunde, dave, dpshah, eric.rannaud, fernando, taka,
lizf, matt, dradford, ngupta, randy.dunlap, roberto, ryov,
s-uchida, subrata, yoshikawa.takuya, containers, linux-kernel
On Tue, 14 Apr 2009 22:21:11 +0200
Andrea Righi <righi.andrea@gmail.com> wrote:
> Objective
> ~~~~~~~~~
> The objective of the io-throttle controller is to improve IO performance
> predictability of different cgroups that share the same block devices.
We should get an IO controller into Linux. Does anyone have a reason
why it shouldn't be this one?
> Respect to other priority/weight-based solutions the approach used by
> this controller is to explicitly choke applications' requests
Yes, blocking the offending application at a high level has always
seemed to me to be the best way of implementing the controller.
> that
> directly or indirectly generate IO activity in the system (this
> controller addresses both synchronous IO and writeback/buffered IO).
The problem I've seen with some of the proposed controllers was that
they didn't handle delayed writeback very well, if at all.
Can you explain at a high level but in some detail how this works? If
an application is doing a huge write(), how is that detected and how is
the application made to throttle?
Does it add new metadata to `struct page' for this?
I assume that the write throttling is also wired up into the MAP_SHARED
write-fault path?
Does this patchset provide a path by which we can implement IO control
for (say) NFS mounts?
^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20090416152433.aaaba300.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
2009-04-16 22:24 ` Andrew Morton
@ 2009-04-17 9:37 ` Andrea Righi
-1 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2009-04-17 9:37 UTC (permalink / raw)
To: Andrew Morton
Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
menage-hpIqsD4AKlfQT0dZR+AlfA, chlunde-om2ZC0WAoZIXWF+eFR7m5Q,
eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
fernando-gVGce1chcLdL9jVzuh4AOg, dradford-cT2on/YLNlBWk0Htik3J/w,
agk-9JcytcrH/bA+uJoB2kUjGw,
subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
axboe-tSWWG44O7X1aa/9Udqfwiw,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
ngupta-hpIqsD4AKlfQT0dZR+AlfA
On Thu, Apr 16, 2009 at 03:24:33PM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2009 22:21:11 +0200
> Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > Objective
> > ~~~~~~~~~
> > The objective of the io-throttle controller is to improve IO performance
> > predictability of different cgroups that share the same block devices.
>
> We should get an IO controller into Linux. Does anyone have a reason
> why it shouldn't be this one?
>
> > Respect to other priority/weight-based solutions the approach used by
> > this controller is to explicitly choke applications' requests
>
> Yes, blocking the offending application at a high level has always
> seemed to me to be the best way of implementing the controller.
>
> > that
> > directly or indirectly generate IO activity in the system (this
> > controller addresses both synchronous IO and writeback/buffered IO).
>
> The problem I've seen with some of the proposed controllers was that
> they didn't handle delayed writeback very well, if at all.
>
> Can you explain at a high level but in some detail how this works? If
> an application is doing a huge write(), how is that detected and how is
> the application made to throttle?
The writeback writes are handled in three steps:
1) track the owner of the dirty pages
2) detect writeback IO
3) delay writeback IO that exceeds the cgroup limits
For 1) I barely used the bio-cgroup functionality. The bio-cgroup use
the page_cgroup structure to store the owner of each dirty page when the
page is dirtied. At this point the actual owner of the page can be
retrieved looking at current->mm->owner (i.e. in __set_page_dirty()),
and its bio_cgroup id is stored into the page_cgroup structure.
Then for 2) we can detect writeback IO placing a hook,
cgroup_io_throttle(), in submit_bio():
unsigned long long
cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
If the IO operation is a write we look at the owner of the pages
involved (from bio) and we check if we must throttle the operation. If
the owner of that page is "current", we throttle the current task
directly (via schedule_timeout_killable()) and we just return 0 from
cgroup_io_throttle() after the sleep.
3) If the owner of the page must be throttled and the current task is
not the same task, e.g., it's a kernel thread (current->flags &
(PF_KTHREAD | PF_FLUSHER | PF_KSWAPD)), then we assume it's a writeback
IO and we immediately return the amount of jiffies that the real owner
should sleep.
void submit_bio(int rw, struct bio *bio)
{
...
if (bio_has_data(bio)) {
unsigned long sleep = 0;
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
sleep = cgroup_io_throttle(bio,
bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}
...
if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
return;
}
generic_make_request(bio);
...
}
Since the current task must not be throttled here, we set a deadline
jiffies + sleep and we add this request in a rbtree via
iothrottle_make_request().
This request will be dispatched ansychronously by a kernel thread -
kiothrottled() - using generic_make_request() when the deadline will
expire. There's a lot of space for optimizations here, i.e. use many
threads per block device, workqueue, slow-work, ...
In the old version (v12) I simply throttled writeback IO in
balance_dirty_pages_ratelimited_nr() but this obviously leads to bursty
writebacks. In v13 the writeback IO is hugely more smooth.
>
> Does it add new metadata to `struct page' for this?
struct page_cgroup
>
> I assume that the write throttling is also wired up into the MAP_SHARED
> write-fault path?
>
mmmh.. in case of writeback IO we account and throttle requests for
mm->owner. In case of synchronous IO (read/write) we always throttle the
current task in submit_bio().
>
>
> Does this patchset provide a path by which we can implement IO control
> for (say) NFS mounts?
Honestly I didn't looked at all at this. :) I'll check, but in principle
adding the cgroup_io_throttle() hook in the opportune NFS path is enough
to provide IO control also for NFS mounts.
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
@ 2009-04-17 9:37 ` Andrea Righi
0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2009-04-17 9:37 UTC (permalink / raw)
To: Andrew Morton
Cc: menage, balbir, guijianfeng, kamezawa.hiroyu, agk, axboe,
baramsori72, chlunde, dave, dpshah, eric.rannaud, fernando, taka,
lizf, matt, dradford, ngupta, randy.dunlap, roberto, ryov,
s-uchida, subrata, yoshikawa.takuya, containers, linux-kernel
On Thu, Apr 16, 2009 at 03:24:33PM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2009 22:21:11 +0200
> Andrea Righi <righi.andrea@gmail.com> wrote:
>
> > Objective
> > ~~~~~~~~~
> > The objective of the io-throttle controller is to improve IO performance
> > predictability of different cgroups that share the same block devices.
>
> We should get an IO controller into Linux. Does anyone have a reason
> why it shouldn't be this one?
>
> > Respect to other priority/weight-based solutions the approach used by
> > this controller is to explicitly choke applications' requests
>
> Yes, blocking the offending application at a high level has always
> seemed to me to be the best way of implementing the controller.
>
> > that
> > directly or indirectly generate IO activity in the system (this
> > controller addresses both synchronous IO and writeback/buffered IO).
>
> The problem I've seen with some of the proposed controllers was that
> they didn't handle delayed writeback very well, if at all.
>
> Can you explain at a high level but in some detail how this works? If
> an application is doing a huge write(), how is that detected and how is
> the application made to throttle?
The writeback writes are handled in three steps:
1) track the owner of the dirty pages
2) detect writeback IO
3) delay writeback IO that exceeds the cgroup limits
For 1) I barely used the bio-cgroup functionality. The bio-cgroup use
the page_cgroup structure to store the owner of each dirty page when the
page is dirtied. At this point the actual owner of the page can be
retrieved looking at current->mm->owner (i.e. in __set_page_dirty()),
and its bio_cgroup id is stored into the page_cgroup structure.
Then for 2) we can detect writeback IO placing a hook,
cgroup_io_throttle(), in submit_bio():
unsigned long long
cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
If the IO operation is a write we look at the owner of the pages
involved (from bio) and we check if we must throttle the operation. If
the owner of that page is "current", we throttle the current task
directly (via schedule_timeout_killable()) and we just return 0 from
cgroup_io_throttle() after the sleep.
3) If the owner of the page must be throttled and the current task is
not the same task, e.g., it's a kernel thread (current->flags &
(PF_KTHREAD | PF_FLUSHER | PF_KSWAPD)), then we assume it's a writeback
IO and we immediately return the amount of jiffies that the real owner
should sleep.
void submit_bio(int rw, struct bio *bio)
{
...
if (bio_has_data(bio)) {
unsigned long sleep = 0;
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
sleep = cgroup_io_throttle(bio,
bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}
...
if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
return;
}
generic_make_request(bio);
...
}
Since the current task must not be throttled here, we set a deadline
jiffies + sleep and we add this request in a rbtree via
iothrottle_make_request().
This request will be dispatched ansychronously by a kernel thread -
kiothrottled() - using generic_make_request() when the deadline will
expire. There's a lot of space for optimizations here, i.e. use many
threads per block device, workqueue, slow-work, ...
In the old version (v12) I simply throttled writeback IO in
balance_dirty_pages_ratelimited_nr() but this obviously leads to bursty
writebacks. In v13 the writeback IO is hugely more smooth.
>
> Does it add new metadata to `struct page' for this?
struct page_cgroup
>
> I assume that the write throttling is also wired up into the MAP_SHARED
> write-fault path?
>
mmmh.. in case of writeback IO we account and throttle requests for
mm->owner. In case of synchronous IO (read/write) we always throttle the
current task in submit_bio().
>
>
> Does this patchset provide a path by which we can implement IO control
> for (say) NFS mounts?
Honestly I didn't looked at all at this. :) I'll check, but in principle
adding the cgroup_io_throttle() hook in the opportune NFS path is enough
to provide IO control also for NFS mounts.
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
[not found] ` <1239740480-28125-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-04-16 22:24 ` Andrew Morton
@ 2009-04-30 13:20 ` Alan D. Brunelle
1 sibling, 0 replies; 10+ messages in thread
From: Alan D. Brunelle @ 2009-04-30 13:20 UTC (permalink / raw)
To: Andrea Righi
Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Paul Menage,
Carl Henrik Lunde, eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w,
Balbir Singh, fernando-gVGce1chcLdL9jVzuh4AOg,
dradford-cT2on/YLNlBWk0Htik3J/w, agk-9JcytcrH/bA+uJoB2kUjGw,
subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
axboe-tSWWG44O7X1aa/9Udqfwiw,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
ngupta-hpIqsD4AKlfQT0dZR+AlfA
Hi Andrea -
FYI: I ran a simple test using this code to try and gauge the overhead
incurred by enabling this technology. Using a single 400GB volume split
into two 200GB partitions I ran two processes in parallel performing a
mkfs (ext2) on each partition. First w/out cgroup io-throttle and then
with it enabled (with each task having throttling enabled to
400MB/second (much, much more than the device is actually capable of
doing)). The idea here is to see the base overhead of just having the
io-throttle code in the paths.
Doing 30 runs of each (w/out & w/ io-throttle enabled) shows very little
difference (time in seconds)
w/out: min=80.196 avg=80.585 max=81.030 sdev=0.215 spread=0.834
with: min=80.402 avg=80.836 max=81.623 sdev=0.327 spread=1.221
So only around 0.3% overhead - and that may not be conclusive with the
standard deviations seen.
--
FYI: The test was run on 2.6.30-rc1+your patches on a 16-way x86_64 box
(128GB RAM) plus a single FC volume off of a 1Gb FC RAID controller.
Regards,
Alan D. Brunelle
Hewlett-Packard
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
2009-04-14 20:21 Andrea Righi
[not found] ` <1239740480-28125-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2009-04-30 13:20 ` Alan D. Brunelle
[not found] ` <49F9A5BA.9030100-VXdhtT5mjnY@public.gmane.org>
2009-05-01 11:11 ` Andrea Righi
1 sibling, 2 replies; 10+ messages in thread
From: Alan D. Brunelle @ 2009-04-30 13:20 UTC (permalink / raw)
To: Andrea Righi
Cc: Paul Menage, Balbir Singh, Gui Jianfeng, KAMEZAWA Hiroyuki, agk,
akpm, axboe, baramsori72, Carl Henrik Lunde, dave, Divyesh Shah,
eric.rannaud, fernando, Hirokazu Takahashi, Li Zefan, matt,
dradford, ngupta, randy.dunlap, roberto, Ryo Tsuruta,
Satoshi UCHIDA, subrata, yoshikawa.takuya, containers,
linux-kernel
Hi Andrea -
FYI: I ran a simple test using this code to try and gauge the overhead
incurred by enabling this technology. Using a single 400GB volume split
into two 200GB partitions I ran two processes in parallel performing a
mkfs (ext2) on each partition. First w/out cgroup io-throttle and then
with it enabled (with each task having throttling enabled to
400MB/second (much, much more than the device is actually capable of
doing)). The idea here is to see the base overhead of just having the
io-throttle code in the paths.
Doing 30 runs of each (w/out & w/ io-throttle enabled) shows very little
difference (time in seconds)
w/out: min=80.196 avg=80.585 max=81.030 sdev=0.215 spread=0.834
with: min=80.402 avg=80.836 max=81.623 sdev=0.327 spread=1.221
So only around 0.3% overhead - and that may not be conclusive with the
standard deviations seen.
--
FYI: The test was run on 2.6.30-rc1+your patches on a 16-way x86_64 box
(128GB RAM) plus a single FC volume off of a 1Gb FC RAID controller.
Regards,
Alan D. Brunelle
Hewlett-Packard
^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <49F9A5BA.9030100-VXdhtT5mjnY@public.gmane.org>]
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
[not found] ` <49F9A5BA.9030100-VXdhtT5mjnY@public.gmane.org>
@ 2009-05-01 11:11 ` Andrea Righi
0 siblings, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2009-05-01 11:11 UTC (permalink / raw)
To: Alan D. Brunelle
Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Paul Menage,
Carl Henrik Lunde, eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w,
Balbir Singh, fernando-gVGce1chcLdL9jVzuh4AOg,
dradford-cT2on/YLNlBWk0Htik3J/w, agk-9JcytcrH/bA+uJoB2kUjGw,
subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
axboe-tSWWG44O7X1aa/9Udqfwiw,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
ngupta-hpIqsD4AKlfQT0dZR+AlfA
On Thu, Apr 30, 2009 at 09:20:58AM -0400, Alan D. Brunelle wrote:
> Hi Andrea -
Hi Alan,
>
> FYI: I ran a simple test using this code to try and gauge the overhead
> incurred by enabling this technology. Using a single 400GB volume split
> into two 200GB partitions I ran two processes in parallel performing a
> mkfs (ext2) on each partition. First w/out cgroup io-throttle and then
> with it enabled (with each task having throttling enabled to
> 400MB/second (much, much more than the device is actually capable of
> doing)). The idea here is to see the base overhead of just having the
> io-throttle code in the paths.
Interesting. I've never explicitly measured the actual overhead of the
io-throttle infrastructure, I'll add a similar test to the io-throttle
testcase.
>
> Doing 30 runs of each (w/out & w/ io-throttle enabled) shows very little
> difference (time in seconds)
>
> w/out: min=80.196 avg=80.585 max=81.030 sdev=0.215 spread=0.834
> with: min=80.402 avg=80.836 max=81.623 sdev=0.327 spread=1.221
>
> So only around 0.3% overhead - and that may not be conclusive with the
> standard deviations seen.
You should see less overhead with reads respect to a pure write
workload, because with reads we don't need to check if the IO request
occurs in a different IO context. And things should be improved with
v16-rc1
(http://download.systemimager.org/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v16-rc1.patch).
So, it would be also interesting to analyse the overhead of a read
stream compared to a write stream, as well a comparison of random
reads/writes. I'll do that in my next benchmarking session.
>
> --
>
> FYI: The test was run on 2.6.30-rc1+your patches on a 16-way x86_64 box
> (128GB RAM) plus a single FC volume off of a 1Gb FC RAID controller.
>
> Regards,
> Alan D. Brunelle
> Hewlett-Packard
Thanks for posting these results,
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/9] cgroup: io-throttle controller (v13)
2009-04-30 13:20 ` Alan D. Brunelle
[not found] ` <49F9A5BA.9030100-VXdhtT5mjnY@public.gmane.org>
@ 2009-05-01 11:11 ` Andrea Righi
1 sibling, 0 replies; 10+ messages in thread
From: Andrea Righi @ 2009-05-01 11:11 UTC (permalink / raw)
To: Alan D. Brunelle
Cc: Paul Menage, Balbir Singh, Gui Jianfeng, KAMEZAWA Hiroyuki, agk,
akpm, axboe, baramsori72, Carl Henrik Lunde, dave, Divyesh Shah,
eric.rannaud, fernando, Hirokazu Takahashi, Li Zefan, matt,
dradford, ngupta, randy.dunlap, roberto, Ryo Tsuruta,
Satoshi UCHIDA, subrata, yoshikawa.takuya, containers,
linux-kernel
On Thu, Apr 30, 2009 at 09:20:58AM -0400, Alan D. Brunelle wrote:
> Hi Andrea -
Hi Alan,
>
> FYI: I ran a simple test using this code to try and gauge the overhead
> incurred by enabling this technology. Using a single 400GB volume split
> into two 200GB partitions I ran two processes in parallel performing a
> mkfs (ext2) on each partition. First w/out cgroup io-throttle and then
> with it enabled (with each task having throttling enabled to
> 400MB/second (much, much more than the device is actually capable of
> doing)). The idea here is to see the base overhead of just having the
> io-throttle code in the paths.
Interesting. I've never explicitly measured the actual overhead of the
io-throttle infrastructure, I'll add a similar test to the io-throttle
testcase.
>
> Doing 30 runs of each (w/out & w/ io-throttle enabled) shows very little
> difference (time in seconds)
>
> w/out: min=80.196 avg=80.585 max=81.030 sdev=0.215 spread=0.834
> with: min=80.402 avg=80.836 max=81.623 sdev=0.327 spread=1.221
>
> So only around 0.3% overhead - and that may not be conclusive with the
> standard deviations seen.
You should see less overhead with reads respect to a pure write
workload, because with reads we don't need to check if the IO request
occurs in a different IO context. And things should be improved with
v16-rc1
(http://download.systemimager.org/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v16-rc1.patch).
So, it would be also interesting to analyse the overhead of a read
stream compared to a write stream, as well a comparison of random
reads/writes. I'll do that in my next benchmarking session.
>
> --
>
> FYI: The test was run on 2.6.30-rc1+your patches on a 16-way x86_64 box
> (128GB RAM) plus a single FC volume off of a 1Gb FC RAID controller.
>
> Regards,
> Alan D. Brunelle
> Hewlett-Packard
Thanks for posting these results,
-Andrea
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-05-01 11:11 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-14 20:21 [PATCH 0/9] cgroup: io-throttle controller (v13) Andrea Righi
2009-04-14 20:21 Andrea Righi
[not found] ` <1239740480-28125-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-04-16 22:24 ` Andrew Morton
2009-04-16 22:24 ` Andrew Morton
[not found] ` <20090416152433.aaaba300.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-04-17 9:37 ` Andrea Righi
2009-04-17 9:37 ` Andrea Righi
2009-04-30 13:20 ` Alan D. Brunelle
2009-04-30 13:20 ` Alan D. Brunelle
[not found] ` <49F9A5BA.9030100-VXdhtT5mjnY@public.gmane.org>
2009-05-01 11:11 ` Andrea Righi
2009-05-01 11:11 ` Andrea Righi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.