linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] cgroup: limit block I/O bandwidth
@ 2008-01-18 22:39 Naveen Gupta
  2008-01-19 11:17 ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Naveen Gupta @ 2008-01-18 22:39 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Paul Menage, Dhaval Giani, Balbir Singh, LKML

>Paul Menage wrote:
>> On Jan 18,  2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com>  wrote:
>>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
>>>> Allow to limit the  block I/O bandwidth for  specific process containers
>>>> (cgroups) imposing additional delays  on I/O requests for those processes
>>>> that exceed the  limits defined in the control group filesystem.
>>>>
>>>>  Example:
>>>>   # mkdir /dev/cgroup
>>>>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>>> Just a minor nit, can't we name it as io,  keeping in mind that other
>>> controllers are known as cpu and  memory?
>>
>> Or maybe "blockio"?
>
>Agree, blockio seems better. Not all I/O is performed on  block devices
>and in this case we're  considering block devices only.

Here we want to rate limit in block layer, I would think I/O scheduler
is the place where we are in much better position to do this kind of
limiting.

Also we are changing the behavior of application by adding sleeps to
it during request submission. Moreover, we will prevent requests from
being merged since we won't allow them to be submitted in this case.

Since bulk of submission for writes is done in background kernel
threads and we throttle based on limits on current, we will end up
throttling these threads and not the actual processes submitting i/o.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-18 22:39 [PATCH] cgroup: limit block I/O bandwidth Naveen Gupta
@ 2008-01-19 11:17 ` Andrea Righi
  2008-01-20 13:45   ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-19 11:17 UTC (permalink / raw)
  To: Naveen Gupta; +Cc: Paul Menage, Dhaval Giani, Balbir Singh, LKML

Naveen Gupta wrote:
>> Paul Menage wrote:
>>> On Jan 18,  2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com>  wrote:
>>>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
>>>>> Allow to limit the  block I/O bandwidth for  specific process containers
>>>>> (cgroups) imposing additional delays  on I/O requests for those processes
>>>>> that exceed the  limits defined in the control group filesystem.
>>>>>
>>>>>  Example:
>>>>>   # mkdir /dev/cgroup
>>>>>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>>>> Just a minor nit, can't we name it as io,  keeping in mind that other
>>>> controllers are known as cpu and  memory?
>>> Or maybe "blockio"?
>> Agree, blockio seems better. Not all I/O is performed on  block devices
>> and in this case we're  considering block devices only.
> 
> Here we want to rate limit in block layer, I would think I/O scheduler
> is the place where we are in much better position to do this kind of
> limiting.
> 
> Also we are changing the behavior of application by adding sleeps to
> it during request submission. Moreover, we will prevent requests from
> being merged since we won't allow them to be submitted in this case.
> 
> Since bulk of submission for writes is done in background kernel
> threads and we throttle based on limits on current, we will end up
> throttling these threads and not the actual processes submitting i/o.

Yep, that's true! This works for read operations only... at the very
least, if I've understood well, we could throttle I/O reads in the
submit_bio() path and write operations in __set_page_dirty(). But this
would change the applications behavior, so probably the best approcah
could be to just get I/O statistics from TASK_IO_ACCOUNTING stuff and
implement task delays at the I/O scheduler layer...

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-19 11:17 ` Andrea Righi
@ 2008-01-20 13:45   ` Andrea Righi
  2008-01-20 14:32     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-20 13:45 UTC (permalink / raw)
  To: Naveen Gupta, Paul Menage, Dhaval Giani, Balbir Singh; +Cc: Jens Axboe, LKML

Andrea Righi wrote:
> Naveen Gupta wrote:
>>> Paul Menage wrote:
>>>> On Jan 18,  2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com>  wrote:
>>>>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
>>>>>> Allow to limit the  block I/O bandwidth for  specific process containers
>>>>>> (cgroups) imposing additional delays  on I/O requests for those processes
>>>>>> that exceed the  limits defined in the control group filesystem.
>>>>>>
>>>>>>  Example:
>>>>>>   # mkdir /dev/cgroup
>>>>>>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>>>>> Just a minor nit, can't we name it as io,  keeping in mind that other
>>>>> controllers are known as cpu and  memory?
>>>> Or maybe "blockio"?
>>> Agree, blockio seems better. Not all I/O is performed on  block devices
>>> and in this case we're  considering block devices only.
>> Here we want to rate limit in block layer, I would think I/O scheduler
>> is the place where we are in much better position to do this kind of
>> limiting.
>>
>> Also we are changing the behavior of application by adding sleeps to
>> it during request submission. Moreover, we will prevent requests from
>> being merged since we won't allow them to be submitted in this case.
>>
>> Since bulk of submission for writes is done in background kernel
>> threads and we throttle based on limits on current, we will end up
>> throttling these threads and not the actual processes submitting i/o.
> 
> Yep, that's true! This works for read operations only... at the very
> least, if I've understood well, we could throttle I/O reads in the
> submit_bio() path and write operations in __set_page_dirty(). But this
> would change the applications behavior, so probably the best approcah
> could be to just get I/O statistics from TASK_IO_ACCOUNTING stuff and
> implement task delays at the I/O scheduler layer...

OK, to better figure the concept I tried to put the I/O throttling
mechanism inside the simplest scheduler: noop. But this raises a new
problem: under certain conditions (typically with write requests) a
delay imposed on a I/O request of a process could impact on all the
other I/O requests of other processes, causing the whole system to hang
until the first request is completed, so it seem that we've to deal with
a classic priority inversion problem.

Obviously the problem doesn't occur if the limited cgroup performs read
operations only.

I'm posting the modified patch below only for discussion purposes, you
could test it if you want, but you've been warned that the whole system
could hang for certain amounts of time.

---
Allow to limit the block I/O bandwidth for specific process containers
(cgroups) imposing additional delays in the I/O scheduler on the requests made
by those processes that exceed the limits defined in the control group
filesystem.

Example:
  # mkdir /dev/cgroup
  # mount -t cgroup -oblockio blockio /dev/cgroup
  # cd /dev/cgroup
  # mkdir foo
  --> the cgroup foo has been created
  # /bin/echo $$ > foo/tasks
  # /bin/echo 1024 > foo/blockio.bandwidth
  # sh
  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
      bandwidth of 1MB/s.

NOTE: this works only with noop scheduler and, anyway, it's affected by
the priority inversion problem. 

Signed-off-by: Andrea Righi <a.righi@cineca.it>
---

diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c	2008-01-20 13:54:04.000000000 +0100
@@ -0,0 +1,226 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	spinlock_t lock;
+	unsigned long iorate;
+	unsigned long req;
+	unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ *   1. you are capable(CAP_SYS_ADMIN)
+ *   2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+			struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct iothrottle *iot;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+	if (unlikely(!iot))
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&iot->lock);
+	iot->last_request = jiffies;
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *buf,
+			       size_t nbytes, loff_t *ppos)
+{
+	ssize_t count, ret;
+	unsigned long delta, iorate, req, last_request;
+	struct iothrottle *iot;
+	char *page;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+	spin_lock_irq(&iot->lock);
+
+	delta = (long)jiffies - (long)iot->last_request;
+	iorate = iot->iorate;
+	req = iot->req;
+	last_request = iot->last_request;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/* print additional debugging stuff */
+	count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
+			      "    requested: %lu bytes\n"
+			      " last request: %lu jiffies\n"
+			      "        delta: %lu jiffies\n",
+			      iorate, req, last_request, delta);
+
+	ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+				 u64 val)
+{
+	struct iothrottle *iot;
+	int ret = 0;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+
+	spin_lock_irq(&iot->lock);
+	iot->iorate = (unsigned long)val;
+	iot->req = 0;
+	iot->last_request = jiffies;
+	spin_unlock_irq(&iot->lock);
+
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "bandwidth",
+		.read = iothrottle_read,
+		.write_uint = iothrottle_write_uint,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "blockio",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+};
+
+void cgroup_io_account(size_t bytes)
+{
+	struct iothrottle *iot;
+
+	if (!bytes)
+		return;
+
+	iot = task_to_iothrottle(current);
+	if (!iot)
+		return;
+
+	if (!iot->iorate)
+		return;
+
+	iot->req += bytes;
+}
+EXPORT_SYMBOL(cgroup_io_account);
+
+int io_throttle(struct task_struct *task)
+{
+	struct iothrottle *iot;
+	unsigned long delta, t;
+	long sleep;
+
+	iot = task_to_iothrottle(task);
+	if (!iot)
+		return 0;
+
+	if (!iot->iorate)
+		return 0;
+
+	delta = (long)jiffies - (long)iot->last_request;
+	t = msecs_to_jiffies(iot->req / iot->iorate);
+	if (!t)
+		return 0;
+
+	sleep = t - delta;
+	if (sleep > 0)
+		return sleep;
+
+	iot->req = 0;
+	iot->last_request = jiffies;
+
+	return 0;
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c	2008-01-20 13:54:14.000000000 +0100
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
 
 /*
  * for max sense size
@@ -3368,6 +3369,7 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_size);
+			cgroup_io_account(bio->bi_size);
 			count_vm_events(PGPGIN, count);
 		}
 
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile	2008-01-20 13:57:29.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= io-throttle.o
diff -urpN linux-2.6.24-rc8/block/noop-iosched.c linux-2.6.24-rc8-cgroup-io-throttling/block/noop-iosched.c
--- linux-2.6.24-rc8/block/noop-iosched.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/noop-iosched.c	2008-01-20 13:55:35.000000000 +0100
@@ -6,11 +6,27 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/io-throttle.h>
+
+#define RQ_TASK(rq)      ((struct task_struct *) (rq)->elevator_private)
 
 struct noop_data {
 	struct list_head queue;
+	struct request_queue *q;
+	struct work_struct work;
 };
 
+static void noop_work_handler(struct work_struct *work)
+{
+	struct noop_data *nd = container_of(work, struct noop_data, work);
+	struct request_queue *q = nd->q;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
 static void noop_merged_requests(struct request_queue *q, struct request *rq,
 				 struct request *next)
 {
@@ -20,10 +36,16 @@ static void noop_merged_requests(struct 
 static int noop_dispatch(struct request_queue *q, int force)
 {
 	struct noop_data *nd = q->elevator->elevator_data;
+	struct list_head *n, *tmp;
 
-	if (!list_empty(&nd->queue)) {
+	list_for_each_safe(n, tmp, &nd->queue) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(n, struct request, queuelist);
+		if (RQ_TASK(rq))
+			if (io_throttle(RQ_TASK(rq))) {
+				kblockd_schedule_work(&nd->work);
+				continue;
+			}
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -35,6 +57,7 @@ static void noop_add_request(struct requ
 {
 	struct noop_data *nd = q->elevator->elevator_data;
 
+	rq->elevator_private = current;
 	list_add_tail(&rq->queuelist, &nd->queue);
 }
 
@@ -72,6 +95,8 @@ static void *noop_init_queue(struct requ
 	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
 	if (!nd)
 		return NULL;
+	nd->q = q;
+	INIT_WORK(&nd->work, noop_work_handler);
 	INIT_LIST_HEAD(&nd->queue);
 	return nd;
 }
@@ -81,6 +106,7 @@ static void noop_exit_queue(elevator_t *
 	struct noop_data *nd = e->elevator_data;
 
 	BUG_ON(!list_empty(&nd->queue));
+	kblockd_flush_work(&nd->work);
 	kfree(nd);
 }
 
diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
--- linux-2.6.24-rc8/fs/buffer.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c	2008-01-20 13:55:44.000000000 +0100
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/io-throttle.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -713,6 +714,7 @@ static int __set_page_dirty(struct page 
 			__inc_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
+			cgroup_io_account(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
--- linux-2.6.24-rc8/fs/direct-io.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c	2008-01-20 13:55:55.000000000 +0100
@@ -35,6 +35,7 @@
 #include <linux/buffer_head.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
+#include <linux/io-throttle.h>
 #include <asm/atomic.h>
 
 /*
@@ -667,6 +668,7 @@ submit_page_section(struct dio *dio, str
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		cgroup_io_account(len);
 	}
 
 	/*
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h	2008-01-20 13:59:53.000000000 +0100
@@ -0,0 +1,12 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern int io_throttle(struct task_struct *task);
+extern void cgroup_io_account(size_t bytes);
+#else
+static inline int io_throttle(struct task_struct *task) { }
+static inline void cgroup_io_account(size_t bytes) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig	2008-01-20 13:57:43.000000000 +0100
@@ -313,6 +313,16 @@ config CGROUP_NS
           for instance virtual servers and checkpoint/restart
           jobs.
 
+config CGROUP_IO_THROTTLE
+        bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+        depends on EXPERIMENTAL && CGROUPS
+	select TASK_IO_ACCOUNTING
+        help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+
+          Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS
diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
--- linux-2.6.24-rc8/mm/page-writeback.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c	2008-01-20 13:56:13.000000000 +0100
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/io-throttle.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
 				__inc_bdi_stat(mapping->backing_dev_info,
 						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
+				cgroup_io_account(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
--- linux-2.6.24-rc8/mm/readahead.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c	2008-01-20 13:56:02.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <linux/io-throttle.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -76,6 +77,7 @@ int read_cache_pages(struct address_spac
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		cgroup_io_account(PAGE_CACHE_SIZE);
 	}
 	return ret;
 }


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 13:45   ` Andrea Righi
@ 2008-01-20 14:32     ` Jens Axboe
  2008-01-20 14:58       ` Balbir Singh
  2008-01-20 15:41       ` Andrea Righi
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2008-01-20 14:32 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Naveen Gupta, Paul Menage, Dhaval Giani, Balbir Singh, LKML

On Sun, Jan 20 2008, Andrea Righi wrote:
> Andrea Righi wrote:
> > Naveen Gupta wrote:
> >>> Paul Menage wrote:
> >>>> On Jan 18,  2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com>  wrote:
> >>>>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
> >>>>>> Allow to limit the  block I/O bandwidth for  specific process containers
> >>>>>> (cgroups) imposing additional delays  on I/O requests for those processes
> >>>>>> that exceed the  limits defined in the control group filesystem.
> >>>>>>
> >>>>>>  Example:
> >>>>>>   # mkdir /dev/cgroup
> >>>>>>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
> >>>>> Just a minor nit, can't we name it as io,  keeping in mind that other
> >>>>> controllers are known as cpu and  memory?
> >>>> Or maybe "blockio"?
> >>> Agree, blockio seems better. Not all I/O is performed on  block devices
> >>> and in this case we're  considering block devices only.
> >> Here we want to rate limit in block layer, I would think I/O scheduler
> >> is the place where we are in much better position to do this kind of
> >> limiting.
> >>
> >> Also we are changing the behavior of application by adding sleeps to
> >> it during request submission. Moreover, we will prevent requests from
> >> being merged since we won't allow them to be submitted in this case.
> >>
> >> Since bulk of submission for writes is done in background kernel
> >> threads and we throttle based on limits on current, we will end up
> >> throttling these threads and not the actual processes submitting i/o.
> > 
> > Yep, that's true! This works for read operations only... at the very
> > least, if I've understood well, we could throttle I/O reads in the
> > submit_bio() path and write operations in __set_page_dirty(). But this
> > would change the applications behavior, so probably the best approcah
> > could be to just get I/O statistics from TASK_IO_ACCOUNTING stuff and
> > implement task delays at the I/O scheduler layer...
> 
> OK, to better figure the concept I tried to put the I/O throttling
> mechanism inside the simplest scheduler: noop. But this raises a new
> problem: under certain conditions (typically with write requests) a
> delay imposed on a I/O request of a process could impact on all the
> other I/O requests of other processes, causing the whole system to hang
> until the first request is completed, so it seem that we've to deal with
> a classic priority inversion problem.
> 
> Obviously the problem doesn't occur if the limited cgroup performs read
> operations only.
> 
> I'm posting the modified patch below only for discussion purposes, you
> could test it if you want, but you've been warned that the whole system
> could hang for certain amounts of time.

Your approach is totally flawed, imho. For instance, you don't want a
process to be able to dirty memory at foo mb/sec but only actually
write them out at bar mb/sec.

The noop-iosched changes are also very buggy. The queue back pointer
breaks reference counting and the task pointer storage assumes the task
will also always be around. That's of course not the case.

IOW, you are doing this at the wrong level.

What problem are you trying to solve?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 14:32     ` Jens Axboe
@ 2008-01-20 14:58       ` Balbir Singh
  2008-01-20 15:41       ` Andrea Righi
  1 sibling, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2008-01-20 14:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrea Righi, Naveen Gupta, Paul Menage, Dhaval Giani, LKML,
	Pavel Emelyanov

* Jens Axboe <jens.axboe@oracle.com> [2008-01-20 15:32:40]:

> On Sun, Jan 20 2008, Andrea Righi wrote:
> Your approach is totally flawed, imho. For instance, you don't want a
> process to be able to dirty memory at foo mb/sec but only actually
> write them out at bar mb/sec.
> 
> The noop-iosched changes are also very buggy. The queue back pointer
> breaks reference counting and the task pointer storage assumes the task
> will also always be around. That's of course not the case.
> 
> IOW, you are doing this at the wrong level.

Andrea, Some of the problems pointed out so far have been solved in
the IO controller from OpenVZ (cc'ing Pavel for inputs).

> 
> What problem are you trying to solve?
> 
> -- 
> Jens Axboe
> 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 14:32     ` Jens Axboe
  2008-01-20 14:58       ` Balbir Singh
@ 2008-01-20 15:41       ` Andrea Righi
  2008-01-20 16:06         ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-20 15:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Naveen Gupta, Paul Menage, Dhaval Giani, Balbir Singh, LKML

Jens Axboe wrote:
> Your approach is totally flawed, imho. For instance, you don't want a
> process to be able to dirty memory at foo mb/sec but only actually
> write them out at bar mb/sec.

Right. Actually my problem here is that the processes that write out
blocks are different respect to the processes that write bytes in
memory, and I would be able to add limits on those processes that are
dirtying memory.

> The noop-iosched changes are also very buggy. The queue back pointer
> breaks reference counting and the task pointer storage assumes the task
> will also always be around. That's of course not the case.

Yes, this really need a lot of fixes. I simply posted the patch to know
if such approach (in general) could have sense or not.

> IOW, you are doing this at the wrong level.
> 
> What problem are you trying to solve?

Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
in order to provide a sort of a QoS on block I/O.

Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
any help is appreciated.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 15:41       ` Andrea Righi
@ 2008-01-20 16:06         ` Jens Axboe
  2008-01-20 23:59           ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2008-01-20 16:06 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Naveen Gupta, Paul Menage, Dhaval Giani, Balbir Singh, LKML

On Sun, Jan 20 2008, Andrea Righi wrote:
> Jens Axboe wrote:
> > Your approach is totally flawed, imho. For instance, you don't want a
> > process to be able to dirty memory at foo mb/sec but only actually
> > write them out at bar mb/sec.
> 
> Right. Actually my problem here is that the processes that write out
> blocks are different respect to the processes that write bytes in
> memory, and I would be able to add limits on those processes that are
> dirtying memory.

That's another reason why you cannot do this on a per-process or group
basis, because you have no way of mapping back from the io queue path
which process originally dirtied this memory.

> > The noop-iosched changes are also very buggy. The queue back pointer
> > breaks reference counting and the task pointer storage assumes the task
> > will also always be around. That's of course not the case.
> 
> Yes, this really need a lot of fixes. I simply posted the patch to know
> if such approach (in general) could have sense or not.

It doesn't need fixes, it needs to be redesigned :-). No amount of
fixing will make the patch you posted correct, since the approach is
simply not feasible.

> > IOW, you are doing this at the wrong level.
> > 
> > What problem are you trying to solve?
> 
> Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
> in order to provide a sort of a QoS on block I/O.
> 
> Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
> any help is appreciated.

For starters, you want to throttle when queuing IO, not dispatching it.
If you need to modify IO schedulers, then you are already at the wrong
level. That doesn't solve the write problem, but reads can be done.

If you want to solve for both read/write(2), then move the code to that
level. That wont work for eg mmap, though...

And as Balbir notes, the openvz group have been looking at some of these
problems as well. As has lots of other people btw, you probably want to
search around a bit and acquaint yourself with some of that work.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 16:06         ` Jens Axboe
@ 2008-01-20 23:59           ` Andrea Righi
  2008-01-22 19:02             ` Naveen Gupta
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-20 23:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Naveen Gupta, Paul Menage, Dhaval Giani, Balbir Singh, LKML,
	Pavel Emelyanov

Jens Axboe wrote:
> On Sun, Jan 20 2008, Andrea Righi wrote:
>> Jens Axboe wrote:
>>> Your approach is totally flawed, imho. For instance, you don't want a
>>> process to be able to dirty memory at foo mb/sec but only actually
>>> write them out at bar mb/sec.
>> Right. Actually my problem here is that the processes that write out
>> blocks are different respect to the processes that write bytes in
>> memory, and I would be able to add limits on those processes that are
>> dirtying memory.
> 
> That's another reason why you cannot do this on a per-process or group
> basis, because you have no way of mapping back from the io queue path
> which process originally dirtied this memory.
> 
>>> The noop-iosched changes are also very buggy. The queue back pointer
>>> breaks reference counting and the task pointer storage assumes the task
>>> will also always be around. That's of course not the case.
>> Yes, this really need a lot of fixes. I simply posted the patch to know
>> if such approach (in general) could have sense or not.
> 
> It doesn't need fixes, it needs to be redesigned :-). No amount of
> fixing will make the patch you posted correct, since the approach is
> simply not feasible.
> 
>>> IOW, you are doing this at the wrong level.
>>>
>>> What problem are you trying to solve?
>> Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
>> in order to provide a sort of a QoS on block I/O.
>>
>> Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
>> any help is appreciated.
> 
> For starters, you want to throttle when queuing IO, not dispatching it.
> If you need to modify IO schedulers, then you are already at the wrong
> level. That doesn't solve the write problem, but reads can be done.
> 
> If you want to solve for both read/write(2), then move the code to that
> level. That wont work for eg mmap, though...
> 
> And as Balbir notes, the openvz group have been looking at some of these
> problems as well. As has lots of other people btw, you probably want to
> search around a bit and acquaint yourself with some of that work.
> 

OK, now I understand, the main problem is that pages can be written to
the block device when information about the real process that touched
those pages in memory isn't available anymore. So, the I/O scheduler is
not the right place to do such limitations. Another approach would be to
just limit the I/O requests/sec for read operations and the dirty
memory/sec for write operations (see below). But this is ugly and not
efficient at all, since it just limits the writes in memory and not in
the actual block devices.

AFAIK openvz supports the per-VE I/O priority (via CFQ), that is great, but
this isn't the same as bandwidth limiting. Anyway I'll look closely at the
openvz work to see how they addressed the problem.

Thanks,
-Andrea

Signed-off-by: Andrea Righi <a.righi@cineca.it>
---

diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c	2008-01-21 00:40:25.000000000 +0100
@@ -0,0 +1,221 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	spinlock_t lock;
+	unsigned long iorate;
+	unsigned long req;
+	unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ *   1. you are capable(CAP_SYS_ADMIN)
+ *   2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+			struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct iothrottle *iot;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+	if (unlikely(!iot))
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&iot->lock);
+	iot->last_request = jiffies;
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *buf,
+			       size_t nbytes, loff_t *ppos)
+{
+	ssize_t count, ret;
+	unsigned long delta, iorate, req, last_request;
+	struct iothrottle *iot;
+	char *page;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+	spin_lock_irq(&iot->lock);
+
+	delta = (long)jiffies - (long)iot->last_request;
+	iorate = iot->iorate;
+	req = iot->req;
+	last_request = iot->last_request;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/* print additional debugging stuff */
+	count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
+			      "    requested: %lu bytes\n"
+			      " last request: %lu jiffies\n"
+			      "        delta: %lu jiffies\n",
+			      iorate, req, last_request, delta);
+
+	ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+				 u64 val)
+{
+	struct iothrottle *iot;
+	int ret = 0;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+
+	spin_lock_irq(&iot->lock);
+	iot->iorate = (unsigned long)val;
+	iot->req = 0;
+	iot->last_request = jiffies;
+	spin_unlock_irq(&iot->lock);
+
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "bandwidth",
+		.read = iothrottle_read,
+		.write_uint = iothrottle_write_uint,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "blockio",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+};
+
+void cgroup_io_account(size_t bytes)
+{
+	struct iothrottle *iot;
+
+	iot = task_to_iothrottle(current);
+	if (!iot || !iot->iorate)
+		return;
+
+	iot->req += bytes;
+}
+EXPORT_SYMBOL(cgroup_io_account);
+
+void io_throttle(void)
+{
+	struct iothrottle *iot;
+	unsigned long delta, t;
+	long sleep;
+
+	iot = task_to_iothrottle(current);
+	if (!iot || !iot->iorate)
+		return;
+
+	delta = (long)jiffies - (long)iot->last_request;
+	if (!delta)
+		return;
+
+	t = msecs_to_jiffies(iot->req / iot->iorate);
+	if (!t)
+		return;
+
+	sleep = t - delta;
+	if (sleep > 0) {
+		pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+			 current, current->comm, sleep);
+		schedule_timeout_uninterruptible(sleep);
+	}
+
+	iot->req = 0;
+	iot->last_request = jiffies;
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c	2008-01-21 00:40:09.000000000 +0100
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
 
 /*
  * for max sense size
@@ -3368,6 +3369,8 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_size);
+			cgroup_io_account(bio->bi_size);
+			io_throttle();
 			count_vm_events(PGPGIN, count);
 		}
 
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile	2008-01-21 00:40:09.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= io-throttle.o
diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
--- linux-2.6.24-rc8/fs/buffer.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c	2008-01-21 00:40:09.000000000 +0100
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/io-throttle.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -713,12 +714,14 @@ static int __set_page_dirty(struct page 
 			__inc_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
+			cgroup_io_account(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
 	write_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	io_throttle();
 
 	return 1;
 }
diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
--- linux-2.6.24-rc8/fs/direct-io.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c	2008-01-21 00:40:09.000000000 +0100
@@ -35,6 +35,7 @@
 #include <linux/buffer_head.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
+#include <linux/io-throttle.h>
 #include <asm/atomic.h>
 
 /*
@@ -667,6 +668,8 @@ submit_page_section(struct dio *dio, str
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		cgroup_io_account(len);
+		io_throttle();
 	}
 
 	/*
diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
--- linux-2.6.24-rc8/include/linux/cgroup_subsys.h	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h	2008-01-21 00:40:09.000000000 +0100
@@ -37,3 +37,9 @@ SUBSYS(cpuacct)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h	2008-01-21 00:40:09.000000000 +0100
@@ -0,0 +1,12 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void io_throttle(void);
+extern void cgroup_io_account(size_t bytes);
+#else
+static inline void io_throttle(void) { }
+static inline void cgroup_io_account(size_t bytes) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig	2008-01-21 00:40:09.000000000 +0100
@@ -313,6 +313,16 @@ config CGROUP_NS
           for instance virtual servers and checkpoint/restart
           jobs.
 
+config CGROUP_IO_THROTTLE
+        bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+        depends on EXPERIMENTAL && CGROUPS
+	select TASK_IO_ACCOUNTING
+        help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+
+          Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS
diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
--- linux-2.6.24-rc8/mm/page-writeback.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c	2008-01-21 00:40:09.000000000 +0100
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/io-throttle.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
 				__inc_bdi_stat(mapping->backing_dev_info,
 						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
+				cgroup_io_account(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -1023,6 +1025,7 @@ int __set_page_dirty_nobuffers(struct pa
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
+		io_throttle();
 		return 1;
 	}
 	return 0;
diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
--- linux-2.6.24-rc8/mm/readahead.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c	2008-01-21 00:40:09.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <linux/io-throttle.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -76,6 +77,8 @@ int read_cache_pages(struct address_spac
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		cgroup_io_account(PAGE_CACHE_SIZE);
+		io_throttle();
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-20 23:59           ` Andrea Righi
@ 2008-01-22 19:02             ` Naveen Gupta
  2008-01-22 23:11               ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Naveen Gupta @ 2008-01-22 19:02 UTC (permalink / raw)
  To: righiandr
  Cc: Jens Axboe, Paul Menage, Dhaval Giani, Balbir Singh, LKML,
	Pavel Emelyanov

On 20/01/2008, Andrea Righi <righiandr@users.sourceforge.net> wrote:
> Jens Axboe wrote:
> > On Sun, Jan 20 2008, Andrea Righi wrote:
> >> Jens Axboe wrote:
> >>> Your approach is totally flawed, imho. For instance, you don't want a
> >>> process to be able to dirty memory at foo mb/sec but only actually
> >>> write them out at bar mb/sec.
> >> Right. Actually my problem here is that the processes that write out
> >> blocks are different respect to the processes that write bytes in
> >> memory, and I would be able to add limits on those processes that are
> >> dirtying memory.
> >
> > That's another reason why you cannot do this on a per-process or group
> > basis, because you have no way of mapping back from the io queue path
> > which process originally dirtied this memory.
> >
> >>> The noop-iosched changes are also very buggy. The queue back pointer
> >>> breaks reference counting and the task pointer storage assumes the task
> >>> will also always be around. That's of course not the case.
> >> Yes, this really need a lot of fixes. I simply posted the patch to know
> >> if such approach (in general) could have sense or not.
> >
> > It doesn't need fixes, it needs to be redesigned :-). No amount of
> > fixing will make the patch you posted correct, since the approach is
> > simply not feasible.
> >
> >>> IOW, you are doing this at the wrong level.
> >>>
> >>> What problem are you trying to solve?
> >> Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
> >> in order to provide a sort of a QoS on block I/O.
> >>
> >> Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
> >> any help is appreciated.
> >
> > For starters, you want to throttle when queuing IO, not dispatching it.
> > If you need to modify IO schedulers, then you are already at the wrong
> > level. That doesn't solve the write problem, but reads can be done.
> >
> > If you want to solve for both read/write(2), then move the code to that
> > level. That wont work for eg mmap, though...
> >
> > And as Balbir notes, the openvz group have been looking at some of these
> > problems as well. As has lots of other people btw, you probably want to
> > search around a bit and acquaint yourself with some of that work.
> >
>
> OK, now I understand, the main problem is that pages can be written to
> the block device when information about the real process that touched
> those pages in memory isn't available anymore. So, the I/O scheduler is
> not the right place to do such limitations. Another approach would be to
> just limit the I/O requests/sec for read operations and the dirty
> memory/sec for write operations (see below). But this is ugly and not
> efficient at all, since it just limits the writes in memory and not in
> the actual block devices.
>
> AFAIK openvz supports the per-VE I/O priority (via CFQ), that is great, but
> this isn't the same as bandwidth limiting. Anyway I'll look closely at the
> openvz work to see how they addressed the problem.

See if using priority levels to have per level bandwidth limit can
solve the priority inversion problem you were seeing earlier. I have a
priority scheduling patch for anticipatory scheduler, if you want to
try it. It's much simpler than CFQ priority.  I still need to port it
to 2.6.24 though and send across for review.

Though as already said, this would be for read side only.

-Naveen

>
> Thanks,
> -Andrea
>
> Signed-off-by: Andrea Righi <a.righi@cineca.it>
> ---
>
> diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
> --- linux-2.6.24-rc8/block/io-throttle.c        1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-21 00:40:25.000000000 +0100
> @@ -0,0 +1,221 @@
> +/*
> + * io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/gfp.h>
> +#include <linux/err.h>
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/spinlock.h>
> +#include <linux/io-throttle.h>
> +
> +struct iothrottle {
> +       struct cgroup_subsys_state css;
> +       spinlock_t lock;
> +       unsigned long iorate;
> +       unsigned long req;
> +       unsigned long last_request;
> +};
> +
> +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
> +{
> +       return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
> +                           struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> +{
> +       return container_of(task_subsys_state(task, iothrottle_subsys_id),
> +                           struct iothrottle, css);
> +}
> +
> +/*
> + * Rules: you can only create a cgroup if:
> + *   1. you are capable(CAP_SYS_ADMIN)
> + *   2. the target cgroup is a descendant of your own cgroup
> + *
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static struct cgroup_subsys_state *iothrottle_create(
> +                       struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       struct iothrottle *iot;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return ERR_PTR(-EPERM);
> +
> +       if (!cgroup_is_descendant(cont))
> +               return ERR_PTR(-EPERM);
> +
> +       iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
> +       if (unlikely(!iot))
> +               return ERR_PTR(-ENOMEM);
> +
> +       spin_lock_init(&iot->lock);
> +       iot->last_request = jiffies;
> +
> +       return &iot->css;
> +}
> +
> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       kfree(cgroup_to_iothrottle(cont));
> +}
> +
> +static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
> +                              struct file *file, char __user *buf,
> +                              size_t nbytes, loff_t *ppos)
> +{
> +       ssize_t count, ret;
> +       unsigned long delta, iorate, req, last_request;
> +       struct iothrottle *iot;
> +       char *page;
> +
> +       page = (char *)__get_free_page(GFP_TEMPORARY);
> +       if (!page)
> +               return -ENOMEM;
> +
> +       cgroup_lock();
> +       if (cgroup_is_removed(cont)) {
> +               cgroup_unlock();
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +
> +       iot = cgroup_to_iothrottle(cont);
> +       spin_lock_irq(&iot->lock);
> +
> +       delta = (long)jiffies - (long)iot->last_request;
> +       iorate = iot->iorate;
> +       req = iot->req;
> +       last_request = iot->last_request;
> +
> +       spin_unlock_irq(&iot->lock);
> +       cgroup_unlock();
> +
> +       /* print additional debugging stuff */
> +       count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
> +                             "    requested: %lu bytes\n"
> +                             " last request: %lu jiffies\n"
> +                             "        delta: %lu jiffies\n",
> +                             iorate, req, last_request, delta);
> +
> +       ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
> +
> +out:
> +       free_page((unsigned long)page);
> +       return ret;
> +}
> +
> +static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
> +                                u64 val)
> +{
> +       struct iothrottle *iot;
> +       int ret = 0;
> +
> +       cgroup_lock();
> +       if (cgroup_is_removed(cont)) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +
> +       iot = cgroup_to_iothrottle(cont);
> +
> +       spin_lock_irq(&iot->lock);
> +       iot->iorate = (unsigned long)val;
> +       iot->req = 0;
> +       iot->last_request = jiffies;
> +       spin_unlock_irq(&iot->lock);
> +
> +out:
> +       cgroup_unlock();
> +       return ret;
> +}
> +
> +static struct cftype files[] = {
> +       {
> +               .name = "bandwidth",
> +               .read = iothrottle_read,
> +               .write_uint = iothrottle_write_uint,
> +       },
> +};
> +
> +static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
> +}
> +
> +struct cgroup_subsys iothrottle_subsys = {
> +       .name = "blockio",
> +       .create = iothrottle_create,
> +       .destroy = iothrottle_destroy,
> +       .populate = iothrottle_populate,
> +       .subsys_id = iothrottle_subsys_id,
> +};
> +
> +void cgroup_io_account(size_t bytes)
> +{
> +       struct iothrottle *iot;
> +
> +       iot = task_to_iothrottle(current);
> +       if (!iot || !iot->iorate)
> +               return;
> +
> +       iot->req += bytes;
> +}
> +EXPORT_SYMBOL(cgroup_io_account);
> +
> +void io_throttle(void)
> +{
> +       struct iothrottle *iot;
> +       unsigned long delta, t;
> +       long sleep;
> +
> +       iot = task_to_iothrottle(current);
> +       if (!iot || !iot->iorate)
> +               return;
> +
> +       delta = (long)jiffies - (long)iot->last_request;
> +       if (!delta)
> +               return;
> +
> +       t = msecs_to_jiffies(iot->req / iot->iorate);
> +       if (!t)
> +               return;
> +
> +       sleep = t - delta;
> +       if (sleep > 0) {
> +               pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
> +                        current, current->comm, sleep);
> +               schedule_timeout_uninterruptible(sleep);
> +       }
> +
> +       iot->req = 0;
> +       iot->last_request = jiffies;
> +}
> +EXPORT_SYMBOL(io_throttle);
> diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
> --- linux-2.6.24-rc8/block/ll_rw_blk.c  2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c     2008-01-21 00:40:09.000000000 +0100
> @@ -31,6 +31,7 @@
>  #include <linux/blktrace_api.h>
>  #include <linux/fault-inject.h>
>  #include <linux/scatterlist.h>
> +#include <linux/io-throttle.h>
>
>  /*
>   * for max sense size
> @@ -3368,6 +3369,8 @@ void submit_bio(int rw, struct bio *bio)
>                         count_vm_events(PGPGOUT, count);
>                 } else {
>                         task_io_account_read(bio->bi_size);
> +                       cgroup_io_account(bio->bi_size);
> +                       io_throttle();
>                         count_vm_events(PGPGIN, count);
>                 }
>
> diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
> --- linux-2.6.24-rc8/block/Makefile     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile        2008-01-21 00:40:09.000000000 +0100
> @@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)     += cfq-iosched
>
>  obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
>  obj-$(CONFIG_BLOCK_COMPAT)     += compat_ioctl.o
> +
> +obj-$(CONFIG_CGROUP_IO_THROTTLE)       += io-throttle.o
> diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
> --- linux-2.6.24-rc8/fs/buffer.c        2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c   2008-01-21 00:40:09.000000000 +0100
> @@ -41,6 +41,7 @@
>  #include <linux/bitops.h>
>  #include <linux/mpage.h>
>  #include <linux/bit_spinlock.h>
> +#include <linux/io-throttle.h>
>
>  static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
>
> @@ -713,12 +714,14 @@ static int __set_page_dirty(struct page
>                         __inc_bdi_stat(mapping->backing_dev_info,
>                                         BDI_RECLAIMABLE);
>                         task_io_account_write(PAGE_CACHE_SIZE);
> +                       cgroup_io_account(PAGE_CACHE_SIZE);
>                 }
>                 radix_tree_tag_set(&mapping->page_tree,
>                                 page_index(page), PAGECACHE_TAG_DIRTY);
>         }
>         write_unlock_irq(&mapping->tree_lock);
>         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +       io_throttle();
>
>         return 1;
>  }
> diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
> --- linux-2.6.24-rc8/fs/direct-io.c     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c        2008-01-21 00:40:09.000000000 +0100
> @@ -35,6 +35,7 @@
>  #include <linux/buffer_head.h>
>  #include <linux/rwsem.h>
>  #include <linux/uio.h>
> +#include <linux/io-throttle.h>
>  #include <asm/atomic.h>
>
>  /*
> @@ -667,6 +668,8 @@ submit_page_section(struct dio *dio, str
>                  * Read accounting is performed in submit_bio()
>                  */
>                 task_io_account_write(len);
> +               cgroup_io_account(len);
> +               io_throttle();
>         }
>
>         /*
> diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
> --- linux-2.6.24-rc8/include/linux/cgroup_subsys.h      2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h 2008-01-21 00:40:09.000000000 +0100
> @@ -37,3 +37,9 @@ SUBSYS(cpuacct)
>
>  /* */
>
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +SUBSYS(iothrottle)
> +#endif
> +
> +/* */
> +
> diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
> --- linux-2.6.24-rc8/include/linux/io-throttle.h        1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h   2008-01-21 00:40:09.000000000 +0100
> @@ -0,0 +1,12 @@
> +#ifndef IO_THROTTLE_H
> +#define IO_THROTTLE_H
> +
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +extern void io_throttle(void);
> +extern void cgroup_io_account(size_t bytes);
> +#else
> +static inline void io_throttle(void) { }
> +static inline void cgroup_io_account(size_t bytes) { }
> +#endif /* CONFIG_CGROUP_IO_THROTTLE */
> +
> +#endif
> diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
> --- linux-2.6.24-rc8/init/Kconfig       2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig  2008-01-21 00:40:09.000000000 +0100
> @@ -313,6 +313,16 @@ config CGROUP_NS
>            for instance virtual servers and checkpoint/restart
>            jobs.
>
> +config CGROUP_IO_THROTTLE
> +        bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
> +        depends on EXPERIMENTAL && CGROUPS
> +       select TASK_IO_ACCOUNTING
> +        help
> +         This allows to limit the maximum I/O bandwidth for specific
> +         cgroup(s).
> +
> +          Say N if unsure.
> +
>  config CPUSETS
>         bool "Cpuset support"
>         depends on SMP && CGROUPS
> diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
> --- linux-2.6.24-rc8/mm/page-writeback.c        2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c   2008-01-21 00:40:09.000000000 +0100
> @@ -34,6 +34,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <linux/io-throttle.h>
>
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
>                                 __inc_bdi_stat(mapping->backing_dev_info,
>                                                 BDI_RECLAIMABLE);
>                                 task_io_account_write(PAGE_CACHE_SIZE);
> +                               cgroup_io_account(PAGE_CACHE_SIZE);
>                         }
>                         radix_tree_tag_set(&mapping->page_tree,
>                                 page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -1023,6 +1025,7 @@ int __set_page_dirty_nobuffers(struct pa
>                         /* !PageAnon && !swapper_space */
>                         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>                 }
> +               io_throttle();
>                 return 1;
>         }
>         return 0;
> diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
> --- linux-2.6.24-rc8/mm/readahead.c     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c        2008-01-21 00:40:09.000000000 +0100
> @@ -16,6 +16,7 @@
>  #include <linux/task_io_accounting_ops.h>
>  #include <linux/pagevec.h>
>  #include <linux/pagemap.h>
> +#include <linux/io-throttle.h>
>
>  void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
>  {
> @@ -76,6 +77,8 @@ int read_cache_pages(struct address_spac
>                         break;
>                 }
>                 task_io_account_read(PAGE_CACHE_SIZE);
> +               cgroup_io_account(PAGE_CACHE_SIZE);
> +               io_throttle();
>         }
>         return ret;
>  }
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-22 19:02             ` Naveen Gupta
@ 2008-01-22 23:11               ` Andrea Righi
  2008-01-23  1:17                 ` Naveen Gupta
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-22 23:11 UTC (permalink / raw)
  To: Naveen Gupta
  Cc: Jens Axboe, Paul Menage, Dhaval Giani, Balbir Singh, LKML,
	Pavel Emelyanov

Naveen Gupta wrote:
> See if using priority levels to have per level bandwidth limit can
> solve the priority inversion problem you were seeing earlier. I have a
> priority scheduling patch for anticipatory scheduler, if you want to
> try it. It's much simpler than CFQ priority.  I still need to port it
> to 2.6.24 though and send across for review.
> 
> Though as already said, this would be for read side only.
> 
> -Naveen

Thanks Naveen, I can test you scheduler if you want, but the priority
inversion problem (or better we should call it a "bandwidth limiting"
that impacts in wrong tasks) occurs only with write operations and, as
said by Jens, the I/O scheduler is not the right place to implement this
kind of limiting, because at this level the processes have already
performed the operations (dirty pages in memory) that raise the requests
to the I/O scheduler (made by different processes asynchronously).

A possible way to model the write limiting is to look at the dirty page
ratio that is, in part, the principal reason for the requests to the I/O
scheduler. But in this way we would limit also the re-write operations
in memory and this is too much limiting.

So, the cgroup dirty page throttling could be very interesting anyway,
but it's not the same thing as limiting the real write I/O bandwidth.

For now I've rewritten my patch as following, moving away the code from
the I/O scheduler, it seems to work in my small tests (apart all the
things said above), but I'd like to find a different way to have a more
sophisticated I/O throttling approach (probably looking also directly at
the read()/write() level)... just investigating for now...

BTW I've seen that also OpenVZ has not a solution for this problem, yet.
AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by
the user beancounters (http://wiki.openvz.org/IO_accounting), but
there's not any policy that implements the block I/O limiting, except
that it's possible to set different per-VE I/O priorities (mapped on CFQ
priorities). But I've not understood if this just sets this I/O priority
to all processes in the VE, or if it does something different. I still
need to look at the code in details.

-Andrea

Signed-off-by: Andrea Righi <a.righi@cineca.it>
---

diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c	2008-01-22 23:06:09.000000000 +0100
@@ -0,0 +1,222 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	spinlock_t lock;
+	unsigned long iorate;
+	unsigned long req;
+	unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ *   1. you are capable(CAP_SYS_ADMIN)
+ *   2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+			struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct iothrottle *iot;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+	if (unlikely(!iot))
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&iot->lock);
+	iot->last_request = jiffies;
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *buf,
+			       size_t nbytes, loff_t *ppos)
+{
+	ssize_t count, ret;
+	unsigned long delta, iorate, req, last_request;
+	struct iothrottle *iot;
+	char *page;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+	spin_lock_irq(&iot->lock);
+
+	delta = (long)jiffies - (long)iot->last_request;
+	iorate = iot->iorate;
+	req = iot->req;
+	last_request = iot->last_request;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/* print additional debugging stuff */
+	count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
+			      "    requested: %lu bytes\n"
+			      " last request: %lu jiffies\n"
+			      "        delta: %lu jiffies\n",
+			      iorate, req, last_request, delta);
+
+	ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+				 u64 val)
+{
+	struct iothrottle *iot;
+	int ret = 0;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+
+	spin_lock_irq(&iot->lock);
+	iot->iorate = (unsigned long)val;
+	iot->req = 0;
+	iot->last_request = jiffies;
+	spin_unlock_irq(&iot->lock);
+
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "bandwidth",
+		.read = iothrottle_read,
+		.write_uint = iothrottle_write_uint,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "blockio",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+};
+
+void cgroup_io_account(size_t bytes)
+{
+	struct iothrottle *iot;
+
+	iot = task_to_iothrottle(current);
+	if (!iot || !iot->iorate)
+		return;
+
+	iot->req += bytes;
+}
+EXPORT_SYMBOL(cgroup_io_account);
+
+void io_throttle(void)
+{
+	struct iothrottle *iot;
+	unsigned long delta, t;
+	long sleep;
+
+	iot = task_to_iothrottle(current);
+	if (!iot || !iot->iorate)
+		return;
+
+	delta = (long)jiffies - (long)iot->last_request;
+	if (!delta)
+		return;
+
+	t = msecs_to_jiffies(iot->req / iot->iorate);
+	if (!t)
+		return;
+
+	sleep = t - delta;
+	if (sleep > 0) {
+		pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+			 current, current->comm, sleep);
+		schedule_timeout_uninterruptible(sleep);
+		return;
+	}
+
+	iot->req = 0;
+	iot->last_request = jiffies;
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c	2008-01-22 23:04:34.000000000 +0100
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
 
 /*
  * for max sense size
@@ -3368,6 +3369,8 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_size);
+			cgroup_io_account(bio->bi_size);
+			io_throttle();
 			count_vm_events(PGPGIN, count);
 		}
 
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile	2008-01-22 23:04:34.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= io-throttle.o
diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
--- linux-2.6.24-rc8/fs/buffer.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c	2008-01-22 23:04:34.000000000 +0100
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/io-throttle.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -713,12 +714,14 @@ static int __set_page_dirty(struct page 
 			__inc_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
+			cgroup_io_account(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
 	write_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	io_throttle();
 
 	return 1;
 }
diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
--- linux-2.6.24-rc8/fs/direct-io.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c	2008-01-22 23:04:34.000000000 +0100
@@ -35,6 +35,7 @@
 #include <linux/buffer_head.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
+#include <linux/io-throttle.h>
 #include <asm/atomic.h>
 
 /*
@@ -667,6 +668,8 @@ submit_page_section(struct dio *dio, str
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		cgroup_io_account(len);
+		io_throttle();
 	}
 
 	/*
diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
--- linux-2.6.24-rc8/include/linux/cgroup_subsys.h	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h	2008-01-22 23:04:34.000000000 +0100
@@ -37,3 +37,9 @@ SUBSYS(cpuacct)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h	2008-01-22 23:04:34.000000000 +0100
@@ -0,0 +1,12 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void io_throttle(void);
+extern void cgroup_io_account(size_t bytes);
+#else
+static inline void io_throttle(void) { }
+static inline void cgroup_io_account(size_t bytes) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig	2008-01-22 23:16:41.000000000 +0100
@@ -313,6 +313,18 @@ config CGROUP_NS
           for instance virtual servers and checkpoint/restart
           jobs.
 
+config CGROUP_IO_THROTTLE
+	bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && CGROUPS
+	help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s). Actually this works correctly for read operations only.
+	  Write operations are modeled looking at dirty page ratio (write
+	  throttling in memory), since the writes to the real block device are
+	  processed asynchronously by different tasks.
+
+	  Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS
diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
--- linux-2.6.24-rc8/mm/page-writeback.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c	2008-01-22 23:04:34.000000000 +0100
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/io-throttle.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
 				__inc_bdi_stat(mapping->backing_dev_info,
 						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
+				cgroup_io_account(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -1023,6 +1025,7 @@ int __set_page_dirty_nobuffers(struct pa
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
+		io_throttle();
 		return 1;
 	}
 	return 0;
diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
--- linux-2.6.24-rc8/mm/readahead.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c	2008-01-22 23:04:34.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <linux/io-throttle.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -76,6 +77,8 @@ int read_cache_pages(struct address_spac
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		cgroup_io_account(PAGE_CACHE_SIZE);
+		io_throttle();
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-22 23:11               ` Andrea Righi
@ 2008-01-23  1:17                 ` Naveen Gupta
  2008-01-23 15:23                   ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Naveen Gupta @ 2008-01-23  1:17 UTC (permalink / raw)
  To: righiandr
  Cc: Jens Axboe, Paul Menage, Dhaval Giani, Balbir Singh, LKML,
	Pavel Emelyanov

On 22/01/2008, Andrea Righi <righiandr@users.sourceforge.net> wrote:
> Naveen Gupta wrote:
> > See if using priority levels to have per level bandwidth limit can
> > solve the priority inversion problem you were seeing earlier. I have a
> > priority scheduling patch for anticipatory scheduler, if you want to
> > try it. It's much simpler than CFQ priority.  I still need to port it
> > to 2.6.24 though and send across for review.
> >
> > Though as already said, this would be for read side only.
> >
> > -Naveen
>
> Thanks Naveen, I can test you scheduler if you want, but the priority
> inversion problem (or better we should call it a "bandwidth limiting"
> that impacts in wrong tasks) occurs only with write operations and, as
> said by Jens, the I/O scheduler is not the right place to implement this
> kind of limiting, because at this level the processes have already
> performed the operations (dirty pages in memory) that raise the requests
> to the I/O scheduler (made by different processes asynchronously).

If the i/o submission is happening in bursts, and we limit the rate
during submission, we will have to stop the current task from
submitting any further i/o and hence change it's pattern. Also, then
we are limiting the submission rate and not the rate which is going on
the wire as scheduler may reorder.

One of the ways could be - to limit the rate when the i/o is sent out
from the scheduler and if we see that the number of allocated requests
are above a threshold we disallow request allocation in the offending
task. This way an application submitting bursts under the allowed
average rate will not stop frequently. Something like leaky bucket.

Now for dirtying of memory happening in a different context than the
submission path, you could still put a limit looking at the dirty
ratio and this limit is higher than the actual b/w rate you are
looking to achieve. In process making sure you always have something
to write and still  now blow your entire memory. Or you can get really
fancy and track who dirtied the i/o and start limiting it that way.



>
> A possible way to model the write limiting is to look at the dirty page
> ratio that is, in part, the principal reason for the requests to the I/O
> scheduler. But in this way we would limit also the re-write operations
> in memory and this is too much limiting.
>
> So, the cgroup dirty page throttling could be very interesting anyway,
> but it's not the same thing as limiting the real write I/O bandwidth.
>
> For now I've rewritten my patch as following, moving away the code from
> the I/O scheduler, it seems to work in my small tests (apart all the
> things said above), but I'd like to find a different way to have a more
> sophisticated I/O throttling approach (probably looking also directly at
> the read()/write() level)... just investigating for now...
>
> BTW I've seen that also OpenVZ has not a solution for this problem, yet.
> AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by
> the user beancounters (http://wiki.openvz.org/IO_accounting), but
> there's not any policy that implements the block I/O limiting, except
> that it's possible to set different per-VE I/O priorities (mapped on CFQ
> priorities). But I've not understood if this just sets this I/O priority
> to all processes in the VE, or if it does something different. I still
> need to look at the code in details.
>
> -Andrea
>
> Signed-off-by: Andrea Righi <a.righi@cineca.it>
> ---
>
> diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
> --- linux-2.6.24-rc8/block/io-throttle.c        1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-22 23:06:09.000000000 +0100
> @@ -0,0 +1,222 @@
> +/*
> + * io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/gfp.h>
> +#include <linux/err.h>
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/spinlock.h>
> +#include <linux/io-throttle.h>
> +
> +struct iothrottle {
> +       struct cgroup_subsys_state css;
> +       spinlock_t lock;
> +       unsigned long iorate;
> +       unsigned long req;
> +       unsigned long last_request;
> +};
> +
> +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
> +{
> +       return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
> +                           struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> +{
> +       return container_of(task_subsys_state(task, iothrottle_subsys_id),
> +                           struct iothrottle, css);
> +}
> +
> +/*
> + * Rules: you can only create a cgroup if:
> + *   1. you are capable(CAP_SYS_ADMIN)
> + *   2. the target cgroup is a descendant of your own cgroup
> + *
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static struct cgroup_subsys_state *iothrottle_create(
> +                       struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       struct iothrottle *iot;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return ERR_PTR(-EPERM);
> +
> +       if (!cgroup_is_descendant(cont))
> +               return ERR_PTR(-EPERM);
> +
> +       iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
> +       if (unlikely(!iot))
> +               return ERR_PTR(-ENOMEM);
> +
> +       spin_lock_init(&iot->lock);
> +       iot->last_request = jiffies;
> +
> +       return &iot->css;
> +}
> +
> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       kfree(cgroup_to_iothrottle(cont));
> +}
> +
> +static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
> +                              struct file *file, char __user *buf,
> +                              size_t nbytes, loff_t *ppos)
> +{
> +       ssize_t count, ret;
> +       unsigned long delta, iorate, req, last_request;
> +       struct iothrottle *iot;
> +       char *page;
> +
> +       page = (char *)__get_free_page(GFP_TEMPORARY);
> +       if (!page)
> +               return -ENOMEM;
> +
> +       cgroup_lock();
> +       if (cgroup_is_removed(cont)) {
> +               cgroup_unlock();
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +
> +       iot = cgroup_to_iothrottle(cont);
> +       spin_lock_irq(&iot->lock);
> +
> +       delta = (long)jiffies - (long)iot->last_request;
> +       iorate = iot->iorate;
> +       req = iot->req;
> +       last_request = iot->last_request;
> +
> +       spin_unlock_irq(&iot->lock);
> +       cgroup_unlock();
> +
> +       /* print additional debugging stuff */
> +       count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
> +                             "    requested: %lu bytes\n"
> +                             " last request: %lu jiffies\n"
> +                             "        delta: %lu jiffies\n",
> +                             iorate, req, last_request, delta);
> +
> +       ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
> +
> +out:
> +       free_page((unsigned long)page);
> +       return ret;
> +}
> +
> +static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
> +                                u64 val)
> +{
> +       struct iothrottle *iot;
> +       int ret = 0;
> +
> +       cgroup_lock();
> +       if (cgroup_is_removed(cont)) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +
> +       iot = cgroup_to_iothrottle(cont);
> +
> +       spin_lock_irq(&iot->lock);
> +       iot->iorate = (unsigned long)val;
> +       iot->req = 0;
> +       iot->last_request = jiffies;
> +       spin_unlock_irq(&iot->lock);
> +
> +out:
> +       cgroup_unlock();
> +       return ret;
> +}
> +
> +static struct cftype files[] = {
> +       {
> +               .name = "bandwidth",
> +               .read = iothrottle_read,
> +               .write_uint = iothrottle_write_uint,
> +       },
> +};
> +
> +static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +       return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
> +}
> +
> +struct cgroup_subsys iothrottle_subsys = {
> +       .name = "blockio",
> +       .create = iothrottle_create,
> +       .destroy = iothrottle_destroy,
> +       .populate = iothrottle_populate,
> +       .subsys_id = iothrottle_subsys_id,
> +};
> +
> +void cgroup_io_account(size_t bytes)
> +{
> +       struct iothrottle *iot;
> +
> +       iot = task_to_iothrottle(current);
> +       if (!iot || !iot->iorate)
> +               return;
> +
> +       iot->req += bytes;
> +}
> +EXPORT_SYMBOL(cgroup_io_account);
> +
> +void io_throttle(void)
> +{
> +       struct iothrottle *iot;
> +       unsigned long delta, t;
> +       long sleep;
> +
> +       iot = task_to_iothrottle(current);
> +       if (!iot || !iot->iorate)
> +               return;
> +
> +       delta = (long)jiffies - (long)iot->last_request;
> +       if (!delta)
> +               return;
> +
> +       t = msecs_to_jiffies(iot->req / iot->iorate);
> +       if (!t)
> +               return;
> +
> +       sleep = t - delta;
> +       if (sleep > 0) {
> +               pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
> +                        current, current->comm, sleep);
> +               schedule_timeout_uninterruptible(sleep);
> +               return;
> +       }
> +
> +       iot->req = 0;
> +       iot->last_request = jiffies;
> +}
> +EXPORT_SYMBOL(io_throttle);
> diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
> --- linux-2.6.24-rc8/block/ll_rw_blk.c  2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c     2008-01-22 23:04:34.000000000 +0100
> @@ -31,6 +31,7 @@
>  #include <linux/blktrace_api.h>
>  #include <linux/fault-inject.h>
>  #include <linux/scatterlist.h>
> +#include <linux/io-throttle.h>
>
>  /*
>   * for max sense size
> @@ -3368,6 +3369,8 @@ void submit_bio(int rw, struct bio *bio)
>                         count_vm_events(PGPGOUT, count);
>                 } else {
>                         task_io_account_read(bio->bi_size);
> +                       cgroup_io_account(bio->bi_size);
> +                       io_throttle();
>                         count_vm_events(PGPGIN, count);
>                 }
>
> diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
> --- linux-2.6.24-rc8/block/Makefile     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile        2008-01-22 23:04:34.000000000 +0100
> @@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)     += cfq-iosched
>
>  obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
>  obj-$(CONFIG_BLOCK_COMPAT)     += compat_ioctl.o
> +
> +obj-$(CONFIG_CGROUP_IO_THROTTLE)       += io-throttle.o
> diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
> --- linux-2.6.24-rc8/fs/buffer.c        2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c   2008-01-22 23:04:34.000000000 +0100
> @@ -41,6 +41,7 @@
>  #include <linux/bitops.h>
>  #include <linux/mpage.h>
>  #include <linux/bit_spinlock.h>
> +#include <linux/io-throttle.h>
>
>  static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
>
> @@ -713,12 +714,14 @@ static int __set_page_dirty(struct page
>                         __inc_bdi_stat(mapping->backing_dev_info,
>                                         BDI_RECLAIMABLE);
>                         task_io_account_write(PAGE_CACHE_SIZE);
> +                       cgroup_io_account(PAGE_CACHE_SIZE);
>                 }
>                 radix_tree_tag_set(&mapping->page_tree,
>                                 page_index(page), PAGECACHE_TAG_DIRTY);
>         }
>         write_unlock_irq(&mapping->tree_lock);
>         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +       io_throttle();
>
>         return 1;
>  }
> diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
> --- linux-2.6.24-rc8/fs/direct-io.c     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c        2008-01-22 23:04:34.000000000 +0100
> @@ -35,6 +35,7 @@
>  #include <linux/buffer_head.h>
>  #include <linux/rwsem.h>
>  #include <linux/uio.h>
> +#include <linux/io-throttle.h>
>  #include <asm/atomic.h>
>
>  /*
> @@ -667,6 +668,8 @@ submit_page_section(struct dio *dio, str
>                  * Read accounting is performed in submit_bio()
>                  */
>                 task_io_account_write(len);
> +               cgroup_io_account(len);
> +               io_throttle();
>         }
>
>         /*
> diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
> --- linux-2.6.24-rc8/include/linux/cgroup_subsys.h      2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h 2008-01-22 23:04:34.000000000 +0100
> @@ -37,3 +37,9 @@ SUBSYS(cpuacct)
>
>  /* */
>
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +SUBSYS(iothrottle)
> +#endif
> +
> +/* */
> +
> diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
> --- linux-2.6.24-rc8/include/linux/io-throttle.h        1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h   2008-01-22 23:04:34.000000000 +0100
> @@ -0,0 +1,12 @@
> +#ifndef IO_THROTTLE_H
> +#define IO_THROTTLE_H
> +
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +extern void io_throttle(void);
> +extern void cgroup_io_account(size_t bytes);
> +#else
> +static inline void io_throttle(void) { }
> +static inline void cgroup_io_account(size_t bytes) { }
> +#endif /* CONFIG_CGROUP_IO_THROTTLE */
> +
> +#endif
> diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
> --- linux-2.6.24-rc8/init/Kconfig       2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig  2008-01-22 23:16:41.000000000 +0100
> @@ -313,6 +313,18 @@ config CGROUP_NS
>            for instance virtual servers and checkpoint/restart
>            jobs.
>
> +config CGROUP_IO_THROTTLE
> +       bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
> +       depends on EXPERIMENTAL && CGROUPS
> +       help
> +         This allows to limit the maximum I/O bandwidth for specific
> +         cgroup(s). Actually this works correctly for read operations only.
> +         Write operations are modeled looking at dirty page ratio (write
> +         throttling in memory), since the writes to the real block device are
> +         processed asynchronously by different tasks.
> +
> +         Say N if unsure.
> +
>  config CPUSETS
>         bool "Cpuset support"
>         depends on SMP && CGROUPS
> diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
> --- linux-2.6.24-rc8/mm/page-writeback.c        2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c   2008-01-22 23:04:34.000000000 +0100
> @@ -34,6 +34,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/buffer_head.h>
>  #include <linux/pagevec.h>
> +#include <linux/io-throttle.h>
>
>  /*
>   * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
>                                 __inc_bdi_stat(mapping->backing_dev_info,
>                                                 BDI_RECLAIMABLE);
>                                 task_io_account_write(PAGE_CACHE_SIZE);
> +                               cgroup_io_account(PAGE_CACHE_SIZE);
>                         }
>                         radix_tree_tag_set(&mapping->page_tree,
>                                 page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -1023,6 +1025,7 @@ int __set_page_dirty_nobuffers(struct pa
>                         /* !PageAnon && !swapper_space */
>                         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>                 }
> +               io_throttle();
>                 return 1;
>         }
>         return 0;
> diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
> --- linux-2.6.24-rc8/mm/readahead.c     2008-01-16 05:22:48.000000000 +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c        2008-01-22 23:04:34.000000000 +0100
> @@ -16,6 +16,7 @@
>  #include <linux/task_io_accounting_ops.h>
>  #include <linux/pagevec.h>
>  #include <linux/pagemap.h>
> +#include <linux/io-throttle.h>
>
>  void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
>  {
> @@ -76,6 +77,8 @@ int read_cache_pages(struct address_spac
>                         break;
>                 }
>                 task_io_account_read(PAGE_CACHE_SIZE);
> +               cgroup_io_account(PAGE_CACHE_SIZE);
> +               io_throttle();
>         }
>         return ret;
>  }
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-23  1:17                 ` Naveen Gupta
@ 2008-01-23 15:23                   ` Andrea Righi
  2008-01-23 15:38                     ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-23 15:23 UTC (permalink / raw)
  To: Naveen Gupta
  Cc: Jens Axboe, Paul Menage, Dhaval Giani, Balbir Singh, LKML,
	Pavel Emelyanov

Naveen Gupta wrote:
> On 22/01/2008, Andrea Righi <righiandr@users.sourceforge.net> wrote:
>> Naveen Gupta wrote:
>>> See if using priority levels to have per level bandwidth limit can
>>> solve the priority inversion problem you were seeing earlier. I have a
>>> priority scheduling patch for anticipatory scheduler, if you want to
>>> try it. It's much simpler than CFQ priority.  I still need to port it
>>> to 2.6.24 though and send across for review.
>>>
>>> Though as already said, this would be for read side only.
>>>
>>> -Naveen
>> Thanks Naveen, I can test you scheduler if you want, but the priority
>> inversion problem (or better we should call it a "bandwidth limiting"
>> that impacts in wrong tasks) occurs only with write operations and, as
>> said by Jens, the I/O scheduler is not the right place to implement this
>> kind of limiting, because at this level the processes have already
>> performed the operations (dirty pages in memory) that raise the requests
>> to the I/O scheduler (made by different processes asynchronously).
> 
> If the i/o submission is happening in bursts, and we limit the rate
> during submission, we will have to stop the current task from
> submitting any further i/o and hence change it's pattern. Also, then
> we are limiting the submission rate and not the rate which is going on
> the wire as scheduler may reorder.

True. Doing i/o throttling at the scheduler level is probably more
correct, at least for read ops.

> One of the ways could be - to limit the rate when the i/o is sent out
> from the scheduler and if we see that the number of allocated requests
> are above a threshold we disallow request allocation in the offending
> task. This way an application submitting bursts under the allowed
> average rate will not stop frequently. Something like leaky bucket.

Right, for read requests too.

> Now for dirtying of memory happening in a different context than the
> submission path, you could still put a limit looking at the dirty
> ratio and this limit is higher than the actual b/w rate you are
> looking to achieve. In process making sure you always have something
> to write and still  now blow your entire memory. Or you can get really
> fancy and track who dirtied the i/o and start limiting it that way.

Probably tracking who dirtied the pages would be the best approach, but
we want also to reduce the overhead of this tracking. So, we should find
a smart way to track which cgroup dirtied the pages and then only when
the i/o scheduler dispatches the write requests of those pages, account
the i/o operations to the opportune cgroup. In this way throttling could
be done probably in __set_page_dirty() as well.

-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-23 15:23                   ` Andrea Righi
@ 2008-01-23 15:38                     ` Balbir Singh
  2008-01-23 20:55                       ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2008-01-23 15:38 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Naveen Gupta, Jens Axboe, Paul Menage, Dhaval Giani, LKML,
	Pavel Emelyanov

* Andrea Righi <righiandr@users.sourceforge.net> [2008-01-23 16:23:59]:

> Probably tracking who dirtied the pages would be the best approach, but
> we want also to reduce the overhead of this tracking. So, we should find
> a smart way to track which cgroup dirtied the pages and then only when
> the i/o scheduler dispatches the write requests of those pages, account
> the i/o operations to the opportune cgroup. In this way throttling could
> be done probably in __set_page_dirty() as well.
>

I think the OpenVZ controller works that way.
 
> -Andrea

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-23 15:38                     ` Balbir Singh
@ 2008-01-23 20:55                       ` Andrea Righi
  2008-01-24  9:05                         ` Pavel Emelyanov
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-23 20:55 UTC (permalink / raw)
  To: Andrea Righi, Naveen Gupta, Jens Axboe, Paul Menage,
	Dhaval Giani, LKML, Pavel Emelyanov

Balbir Singh wrote:
> * Andrea Righi <righiandr@users.sourceforge.net> [2008-01-23 16:23:59]:
> 
>> Probably tracking who dirtied the pages would be the best approach, but
>> we want also to reduce the overhead of this tracking. So, we should find
>> a smart way to track which cgroup dirtied the pages and then only when
>> the i/o scheduler dispatches the write requests of those pages, account
>> the i/o operations to the opportune cgroup. In this way throttling could
>> be done probably in __set_page_dirty() as well.
>>
> 
> I think the OpenVZ controller works that way.

Well... looking at the code it seems that OpenVZ doesn't use this
strategy, instead performs UBC-based I/O accounting looking at the
__set_page_dirty*() for writes and submit_bio() for reads. Then,
independently from accounting data, it uses per-UBC i/o priority model
that is mapped directly on the CFQ i/o priority model.

-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-23 20:55                       ` Andrea Righi
@ 2008-01-24  9:05                         ` Pavel Emelyanov
  2008-01-24 13:48                           ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Emelyanov @ 2008-01-24  9:05 UTC (permalink / raw)
  To: righiandr; +Cc: Naveen Gupta, Jens Axboe, Paul Menage, Dhaval Giani, LKML

Andrea Righi wrote:
> Balbir Singh wrote:
>> * Andrea Righi <righiandr@users.sourceforge.net> [2008-01-23 16:23:59]:
>>
>>> Probably tracking who dirtied the pages would be the best approach, but
>>> we want also to reduce the overhead of this tracking. So, we should find
>>> a smart way to track which cgroup dirtied the pages and then only when
>>> the i/o scheduler dispatches the write requests of those pages, account
>>> the i/o operations to the opportune cgroup. In this way throttling could
>>> be done probably in __set_page_dirty() as well.
>>>
>> I think the OpenVZ controller works that way.
> 
> Well... looking at the code it seems that OpenVZ doesn't use this
> strategy, instead performs UBC-based I/O accounting looking at the

We do track the task (well - the beancounter) who made the page
dirty and then use this context for async write scheduling.

> __set_page_dirty*() for writes and submit_bio() for reads. Then,
> independently from accounting data, it uses per-UBC i/o priority model
> that is mapped directly on the CFQ i/o priority model.

Vasisly Tarasov (out I/O guru ;)) has already prepared an RFC patchset
for Jens with group scheduler (for sync requests only) and is going to
send it this or next week.

> -Andrea
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-24  9:05                         ` Pavel Emelyanov
@ 2008-01-24 13:48                           ` Andrea Righi
  2008-01-24 13:50                             ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Righi @ 2008-01-24 13:48 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Naveen Gupta, Jens Axboe, Paul Menage, Dhaval Giani, LKML, Balbir Singh

Pavel Emelyanov wrote:
> Andrea Righi wrote:
>> Balbir Singh wrote:
>>> * Andrea Righi <righiandr@users.sourceforge.net> [2008-01-23 16:23:59]:
>>>
>>>> Probably tracking who dirtied the pages would be the best approach, but
>>>> we want also to reduce the overhead of this tracking. So, we should find
>>>> a smart way to track which cgroup dirtied the pages and then only when
>>>> the i/o scheduler dispatches the write requests of those pages, account
>>>> the i/o operations to the opportune cgroup. In this way throttling could
>>>> be done probably in __set_page_dirty() as well.
>>>>
>>> I think the OpenVZ controller works that way.
>> Well... looking at the code it seems that OpenVZ doesn't use this
>> strategy, instead performs UBC-based I/O accounting looking at the
> 
> We do track the task (well - the beancounter) who made the page
> dirty and then use this context for async write scheduling.

Interesting... now I see that task_io_account_write() takes a
"struct page *" as argument and the "struct page" has the beancounter
pointer.

> 
>> __set_page_dirty*() for writes and submit_bio() for reads. Then,
>> independently from accounting data, it uses per-UBC i/o priority model
>> that is mapped directly on the CFQ i/o priority model.
> 
> Vasisly Tarasov (out I/O guru ;)) has already prepared an RFC patchset
> for Jens with group scheduler (for sync requests only) and is going to
> send it this or next week.
> 

Very good! I look forward for this patchset.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-24 13:48                           ` Andrea Righi
@ 2008-01-24 13:50                             ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2008-01-24 13:50 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Pavel Emelyanov, Naveen Gupta, Jens Axboe, Paul Menage,
	Dhaval Giani, LKML

* Andrea Righi <righiandr@users.sourceforge.net> [2008-01-24 14:48:01]:

> Pavel Emelyanov wrote:
> > Andrea Righi wrote:
> >> Balbir Singh wrote:
> >>> * Andrea Righi <righiandr@users.sourceforge.net> [2008-01-23 16:23:59]:
> >>>
> >>>> Probably tracking who dirtied the pages would be the best approach, but
> >>>> we want also to reduce the overhead of this tracking. So, we should find
> >>>> a smart way to track which cgroup dirtied the pages and then only when
> >>>> the i/o scheduler dispatches the write requests of those pages, account
> >>>> the i/o operations to the opportune cgroup. In this way throttling could
> >>>> be done probably in __set_page_dirty() as well.
> >>>>
> >>> I think the OpenVZ controller works that way.
> >> Well... looking at the code it seems that OpenVZ doesn't use this
> >> strategy, instead performs UBC-based I/O accounting looking at the
> > 
> > We do track the task (well - the beancounter) who made the page
> > dirty and then use this context for async write scheduling.
> 
> Interesting... now I see that task_io_account_write() takes a
> "struct page *" as argument and the "struct page" has the beancounter
> pointer.
> 
> > 
> >> __set_page_dirty*() for writes and submit_bio() for reads. Then,
> >> independently from accounting data, it uses per-UBC i/o priority model
> >> that is mapped directly on the CFQ i/o priority model.
> > 
> > Vasisly Tarasov (out I/O guru ;)) has already prepared an RFC patchset
> > for Jens with group scheduler (for sync requests only) and is going to
> > send it this or next week.
> > 
> 
> Very good! I look forward for this patchset.
>

Excellent! Can't wait to test it.
 
> Thanks,
> -Andrea

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-18 11:41 Andrea Righi
  2008-01-18 12:36 ` Dhaval Giani
@ 2008-01-18 15:50 ` Andrea Righi
  1 sibling, 0 replies; 22+ messages in thread
From: Andrea Righi @ 2008-01-18 15:50 UTC (permalink / raw)
  To: Dhaval Giani, Balbir Singh, Paul Menage; +Cc: LKML

Andrea Righi wrote:

[snip]

> +static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
> +			       struct file *file, char __user *buf,
> +			       size_t nbytes, loff_t *ppos)
> +{
> +	ssize_t count, ret;
> +	unsigned long delta, iorate, req, last_request;
> +	struct iothrottle *iot;
> +	char *page;
> +
> +	page = (char *)__get_free_page(GFP_TEMPORARY);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	cgroup_lock();
> +	if (cgroup_is_removed(cont)) {
> +		cgroup_unlock();
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	iot = cgroup_to_iothrottle(cont);
> +	spin_lock_irq(&iot->lock);
> +
> +	delta = (long)jiffies - (long)iot->last_request;
> +	iorate = iot->iorate;
> +	req = iot->req << 1;
> +	last_request = iot->last_request;
> +
> +	spin_unlock_irq(&iot->lock);
> +	cgroup_unlock();
> +
> +	/* print additional debugging stuff */
> +	count = sprintf(page, "     io-rate: %lu KiB/sec\n"
> +			      "   requested: %lu KiB\n"
> +			      "last_request: %lu jiffies\n"
> +			      "       delta: %lu jiffies\n",
> +			iorate, req << 1, last_request, delta);
                                ^^^^^^^^
Argh! just found a (minor) bug here... :-( the variable req is already
translated from sectors/sec in KB/sec here, so there's no need to lshift
it again (or better there's no need to shift it before).

Sorry for that. Fixed patch is below.

Signed-off-by: Andrea Righi <a.righi@cineca.it>
---

diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c	2008-01-18 16:14:40.000000000 +0100
@@ -0,0 +1,250 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	spinlock_t lock;
+	unsigned long iorate;
+	unsigned long req;
+	unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ *   1. you are capable(CAP_SYS_ADMIN)
+ *   2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+			struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct iothrottle *iot;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+	if (unlikely(!iot))
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&iot->lock);
+	iot->last_request = jiffies;
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *buf,
+			       size_t nbytes, loff_t *ppos)
+{
+	ssize_t count, ret;
+	unsigned long delta, iorate, req, last_request;
+	struct iothrottle *iot;
+	char *page;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+	spin_lock_irq(&iot->lock);
+
+	delta = (long)jiffies - (long)iot->last_request;
+	iorate = iot->iorate;
+	req = iot->req;
+	last_request = iot->last_request;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/* print additional debugging stuff */
+	count = sprintf(page, "     io-rate: %lu KiB/sec\n"
+			      "   requested: %lu KiB\n"
+			      "last_request: %lu jiffies\n"
+			      "       delta: %lu jiffies\n",
+			iorate, req << 1, last_request, delta);
+
+	ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+				 u64 val)
+{
+	struct iothrottle *iot;
+	int ret = 0;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+
+	spin_lock_irq(&iot->lock);
+	iot->iorate = (unsigned long)val;
+	spin_unlock_irq(&iot->lock);
+
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "io-rate",
+		.read = iothrottle_read,
+		.write_uint = iothrottle_write_uint,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "io-throttle",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+};
+
+void io_throttle(int nr_sectors)
+{
+	struct iothrottle *iot;
+	unsigned long delta, n;
+	long sleep;
+
+	cgroup_lock();
+	iot = task_to_iothrottle(current);
+	if (!iot)
+		goto out;
+
+	spin_lock_irq(&iot->lock);
+	if (!iot->iorate)
+		goto out2;
+
+	/*
+	 * The concept is the following: evaluate the actual I/O rate of a
+	 * process, looking at the sectors requested over the time elapsed from
+	 * the last request. If the actual I/O rate is beyond the maximum
+	 * allowed I/O rate then sleep the current task for the correct amount
+	 * of time, in order to reduce the actual I/O rate under the allowed
+	 * limit.
+	 *
+	 * The time to sleep is evaluated as:
+	 *
+	 *   sleep = (sectors_requested / allowed_iorate) - time_elapsed
+	 */
+	delta = (long)jiffies - (long)iot->last_request;
+	iot->req += nr_sectors;
+	n = iot->req / iot->iorate;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/*
+	 * If it's not possible to evaluate delta (due to a too small interval
+	 * of time between two requests) or n (due to a too small request),
+	 * account the requested sectors in iot->req and sum them to the
+	 * sectors of the next request.
+	 */
+	if (!delta || !n)
+		return;
+
+	/*
+	 * Convert n in jiffies (remember that iot->iorate is in KB/s and we
+	 * need to convert it in sectors/jiffies)
+	 */
+	sleep = msecs_to_jiffies(n * 1000 / 2) - delta;
+	if (sleep > 0) {
+		pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+			 current, current->comm, sleep);
+		schedule_timeout_uninterruptible(sleep);
+	}
+
+	/*
+	 * Note: iothrottle element could be changed during the sleep, so
+	 * we must refresh it before resetting statistics.
+	 */
+	cgroup_lock();
+	iot = task_to_iothrottle(current);
+	if (!iot)
+		goto out;
+
+	spin_lock_irq(&iot->lock);
+	iot->req = 0;
+	iot->last_request = jiffies;
+out2:
+	spin_unlock_irq(&iot->lock);
+out:
+	cgroup_unlock();
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c	2008-01-18 16:14:09.000000000 +0100
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
 
 /*
  * for max sense size
@@ -3221,6 +3222,8 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	io_throttle(nr_sectors);
+
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile	2008-01-18 16:14:09.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= io-throttle.o
diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
--- linux-2.6.24-rc8/include/linux/cgroup_subsys.h	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h	2008-01-18 16:14:09.000000000 +0100
@@ -37,3 +37,9 @@ SUBSYS(cpuacct)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h	2008-01-18 16:14:09.000000000 +0100
@@ -0,0 +1,10 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void io_throttle(int nr_sectors);
+#else
+static inline void io_throttle(int nr_sectors) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig	2008-01-18 16:14:09.000000000 +0100
@@ -313,6 +313,15 @@ config CGROUP_NS
           for instance virtual servers and checkpoint/restart
           jobs.
 
+config CGROUP_IO_THROTTLE
+        bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+        depends on EXPERIMENTAL && CGROUPS
+        help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+
+          Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-18 12:41   ` Paul Menage
@ 2008-01-18 13:02     ` Andrea Righi
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Righi @ 2008-01-18 13:02 UTC (permalink / raw)
  To: Paul Menage; +Cc: Dhaval Giani, Balbir Singh, LKML

Paul Menage wrote:
> On Jan 18, 2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:
>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi wrote:
>>> Allow to limit the block I/O bandwidth for specific process containers
>>> (cgroups) imposing additional delays on I/O requests for those processes
>>> that exceed the limits defined in the control group filesystem.
>>>
>>> Example:
>>>   # mkdir /dev/cgroup
>>>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>> Just a minor nit, can't we name it as io, keeping in mind that other
>> controllers are known as cpu and memory?
> 
> Or maybe "blockio"?

Agree, blockio seems better. Not all I/O is performed on block devices
and in this case we're considering block devices only.

-Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-18 12:36 ` Dhaval Giani
@ 2008-01-18 12:41   ` Paul Menage
  2008-01-18 13:02     ` Andrea Righi
  0 siblings, 1 reply; 22+ messages in thread
From: Paul Menage @ 2008-01-18 12:41 UTC (permalink / raw)
  To: Dhaval Giani; +Cc: Andrea Righi, Balbir Singh, LKML

On Jan 18, 2008 7:36 AM, Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:
> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi wrote:
> > Allow to limit the block I/O bandwidth for specific process containers
> > (cgroups) imposing additional delays on I/O requests for those processes
> > that exceed the limits defined in the control group filesystem.
> >
> > Example:
> >   # mkdir /dev/cgroup
> >   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>
> Just a minor nit, can't we name it as io, keeping in mind that other
> controllers are known as cpu and memory?

Or maybe "blockio"?

Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup: limit block I/O bandwidth
  2008-01-18 11:41 Andrea Righi
@ 2008-01-18 12:36 ` Dhaval Giani
  2008-01-18 12:41   ` Paul Menage
  2008-01-18 15:50 ` Andrea Righi
  1 sibling, 1 reply; 22+ messages in thread
From: Dhaval Giani @ 2008-01-18 12:36 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Balbir Singh, Paul Menage, LKML

On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi wrote:
> Allow to limit the block I/O bandwidth for specific process containers
> (cgroups) imposing additional delays on I/O requests for those processes
> that exceed the limits defined in the control group filesystem.
> 
> Example:
>   # mkdir /dev/cgroup
>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup

Just a minor nit, can't we name it as io, keeping in mind that other
controllers are known as cpu and memory?

Will try it out and give some more feedback.

Thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] cgroup: limit block I/O bandwidth
@ 2008-01-18 11:41 Andrea Righi
  2008-01-18 12:36 ` Dhaval Giani
  2008-01-18 15:50 ` Andrea Righi
  0 siblings, 2 replies; 22+ messages in thread
From: Andrea Righi @ 2008-01-18 11:41 UTC (permalink / raw)
  To: Balbir Singh, Paul Menage; +Cc: LKML

Allow to limit the block I/O bandwidth for specific process containers
(cgroups) imposing additional delays on I/O requests for those processes
that exceed the limits defined in the control group filesystem.

Example:
  # mkdir /dev/cgroup
  # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
  # cd /dev/cgroup
  # mkdir foo
  --> the cgroup foo has been created
  # /bin/echo $$ > foo/tasks
  # /bin/echo 1024 > foo/io-throttle.io-rate
  # sh
  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
      bandwidth of 1MB/s (io-throttle.io-rate is expressed in KB/s).

Future improvements:
* allow to limit also I/O operations per second (instead of KB/s only)

Signed-off-by: Andrea Righi <a.righi@cineca.it>
---

diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c	2008-01-17 23:16:58.000000000 +0100
@@ -0,0 +1,250 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@cineca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	spinlock_t lock;
+	unsigned long iorate;
+	unsigned long req;
+	unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ *   1. you are capable(CAP_SYS_ADMIN)
+ *   2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+			struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct iothrottle *iot;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+	if (unlikely(!iot))
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&iot->lock);
+	iot->last_request = jiffies;
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *buf,
+			       size_t nbytes, loff_t *ppos)
+{
+	ssize_t count, ret;
+	unsigned long delta, iorate, req, last_request;
+	struct iothrottle *iot;
+	char *page;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+	spin_lock_irq(&iot->lock);
+
+	delta = (long)jiffies - (long)iot->last_request;
+	iorate = iot->iorate;
+	req = iot->req << 1;
+	last_request = iot->last_request;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/* print additional debugging stuff */
+	count = sprintf(page, "     io-rate: %lu KiB/sec\n"
+			      "   requested: %lu KiB\n"
+			      "last_request: %lu jiffies\n"
+			      "       delta: %lu jiffies\n",
+			iorate, req << 1, last_request, delta);
+
+	ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+				 u64 val)
+{
+	struct iothrottle *iot;
+	int ret = 0;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	iot = cgroup_to_iothrottle(cont);
+
+	spin_lock_irq(&iot->lock);
+	iot->iorate = (unsigned long)val;
+	spin_unlock_irq(&iot->lock);
+
+out:
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "io-rate",
+		.read = iothrottle_read,
+		.write_uint = iothrottle_write_uint,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "io-throttle",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+};
+
+void io_throttle(int nr_sectors)
+{
+	struct iothrottle *iot;
+	unsigned long delta, n;
+	long sleep;
+
+	cgroup_lock();
+	iot = task_to_iothrottle(current);
+	if (!iot)
+		goto out;
+
+	spin_lock_irq(&iot->lock);
+	if (!iot->iorate)
+		goto out2;
+
+	/*
+	 * The concept is the following: evaluate the actual I/O rate of a
+	 * process, looking at the sectors requested over the time elapsed from
+	 * the last request. If the actual I/O rate is beyond the maximum
+	 * allowed I/O rate then sleep the current task for the correct amount
+	 * of time, in order to reduce the actual I/O rate under the allowed
+	 * limit.
+	 *
+	 * The time to sleep is evaluated as:
+	 *
+	 *   sleep = (sectors_requested / allowed_iorate) - time_elapsed
+	 */
+	delta = (long)jiffies - (long)iot->last_request;
+	iot->req += nr_sectors;
+	n = iot->req / iot->iorate;
+
+	spin_unlock_irq(&iot->lock);
+	cgroup_unlock();
+
+	/*
+	 * If it's not possible to evaluate delta (due to a too small interval
+	 * of time between two requests) or n (due to a too small request),
+	 * account the requested sectors in iot->req and sum them to the
+	 * sectors of the next request.
+	 */
+	if (!delta || !n)
+		return;
+
+	/*
+	 * Convert n in jiffies (remember that iot->iorate is in KB/s and we
+	 * need to convert it in sectors/jiffies)
+	 */
+	sleep = msecs_to_jiffies(n * 1000 / 2) - delta;
+	if (sleep > 0) {
+		pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+			 current, current->comm, sleep);
+		schedule_timeout_uninterruptible(sleep);
+	}
+
+	/*
+	 * Note: iothrottle element could be changed during the sleep, so
+	 * we must refresh it before resetting statistics.
+	 */
+	cgroup_lock();
+	iot = task_to_iothrottle(current);
+	if (!iot)
+		goto out;
+
+	spin_lock_irq(&iot->lock);
+	iot->req = 0;
+	iot->last_request = jiffies;
+out2:
+	spin_unlock_irq(&iot->lock);
+out:
+	cgroup_unlock();
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c	2008-01-17 12:35:13.000000000 +0100
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
 
 /*
  * for max sense size
@@ -3221,6 +3222,8 @@ static inline void __generic_make_reques
 	if (bio_check_eod(bio, nr_sectors))
 		goto end_io;
 
+	io_throttle(nr_sectors);
+
 	/*
 	 * Resolve the mapping until finished. (drivers are
 	 * still free to implement/resolve their own stacking
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile	2008-01-17 12:35:13.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= io-throttle.o
diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
--- linux-2.6.24-rc8/include/linux/cgroup_subsys.h	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h	2008-01-17 12:35:13.000000000 +0100
@@ -37,3 +37,9 @@ SUBSYS(cpuacct)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h	2008-01-17 12:35:13.000000000 +0100
@@ -0,0 +1,10 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void io_throttle(int nr_sectors);
+#else
+static inline void io_throttle(int nr_sectors) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig	2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig	2008-01-17 12:35:13.000000000 +0100
@@ -313,6 +313,15 @@ config CGROUP_NS
           for instance virtual servers and checkpoint/restart
           jobs.
 
+config CGROUP_IO_THROTTLE
+        bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+        depends on EXPERIMENTAL && CGROUPS
+        help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+
+          Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-01-25  8:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-18 22:39 [PATCH] cgroup: limit block I/O bandwidth Naveen Gupta
2008-01-19 11:17 ` Andrea Righi
2008-01-20 13:45   ` Andrea Righi
2008-01-20 14:32     ` Jens Axboe
2008-01-20 14:58       ` Balbir Singh
2008-01-20 15:41       ` Andrea Righi
2008-01-20 16:06         ` Jens Axboe
2008-01-20 23:59           ` Andrea Righi
2008-01-22 19:02             ` Naveen Gupta
2008-01-22 23:11               ` Andrea Righi
2008-01-23  1:17                 ` Naveen Gupta
2008-01-23 15:23                   ` Andrea Righi
2008-01-23 15:38                     ` Balbir Singh
2008-01-23 20:55                       ` Andrea Righi
2008-01-24  9:05                         ` Pavel Emelyanov
2008-01-24 13:48                           ` Andrea Righi
2008-01-24 13:50                             ` Balbir Singh
  -- strict thread matches above, loose matches on Subject: below --
2008-01-18 11:41 Andrea Righi
2008-01-18 12:36 ` Dhaval Giani
2008-01-18 12:41   ` Paul Menage
2008-01-18 13:02     ` Andrea Righi
2008-01-18 15:50 ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).