linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] I/O bandwidth controller and BIO tracking
@ 2008-08-12 12:31 Ryo Tsuruta
  2008-08-12 12:33 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:31 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

Hi everyone,

Here are new releases of dm-ioband and bio-cgroup.

The major change from the previous version is that dm-ioband now supports
a device-mapper bvec merge function, which removes the restriction that
the device-mapper framework automatically splits a I/O request into
several small I/O requests. The size of I/O requests was limited to
PAGE_SIZE when the underlying device, such as software RAID, had its
own merge function. This restriction had ever been applied to all
device-mapper drivers, but it has been solved recently by introducing
the bvec merge function feature into device-mapper.

The release also includes a minor update of bio-cgroup, that removes
some unused code in bio_cgroup_move_task(), which is no longer necessary.

dm-ioband
  Dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver, which gives specified bandwidth to each job
  running on the same block device. A job is a group of processes
  with the same pid or pgrp or uid or a virtual machine such as KVM
  or Xen. A job can also be a cgroup by applying the bio-cgroup patch.

bio-cgroup
  Bio-cgroup is a BIO tracking mechanism, which is implemented on the
  cgroup memory subsystem. With the mechanism, it is able to determine
  which cgroup each of bio belongs to, even when the bio is one of
  delayed-write requests issued from a kernel thread such as pdflush.

The following is a list of patches:

  [PATCH 1/7] dm-ioband: Patch of device-mapper driver
  [PATCH 2/7] dm-ioband: Documentation of design overview, installation,
                         command reference and examples.
  [PATCH 3/7] bio-cgroup: Introduction
  [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s
  [PATCH 6/7] bio-cgroup: Implement the bio-cgroup
  [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband

Please see the following site for more information:
  Linux Block I/O Bandwidth Control Project
  http://people.valinux.co.jp/~ryov/bwctl/

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/7] dm-ioband: Patch of device-mapper driver
  2008-08-12 12:31 [PATCH 0/7] I/O bandwidth controller and BIO tracking Ryo Tsuruta
@ 2008-08-12 12:33 ` Ryo Tsuruta
  2008-08-12 12:34   ` [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples Ryo Tsuruta
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:33 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

This is the dm-ioband version 1.5.0 release.

Dm-ioband is an I/O bandwidth controller implemented as a device-mapper
driver, which gives specified bandwidth to each job running on the same
physical device.

- Can be applied to the kernel 2.6.27-rc1-mm1.
- Changes from 1.4.0 (posted on Aug 4, 2008):
  - Support bvec merge function.
    This removes a restriction which limits the maximum I/O size to
    PAGE_SIZE when the underlying device, such as software RAID, has
    its own merge function.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig linux-2.6.27-rc1-mm1/drivers/md/Kconfig
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Kconfig	2008-08-12 14:15:58.000000000 +0900
@@ -275,4 +275,17 @@ config DM_UEVENT
 	---help---
 	Generate udev events for DM events.
 
+config DM_IOBAND
+	tristate "I/O bandwidth control (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	This device-mapper target allows to define how the
+	available bandwidth of a storage device should be
+	shared between processes, cgroups, the partitions or the LUNs.
+
+	Information on how to use dm-ioband is available in:
+	   <file:Documentation/device-mapper/ioband.txt>.
+
+	If unsure, say N.
+
 endif # MD
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile linux-2.6.27-rc1-mm1/drivers/md/Makefile
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Makefile	2008-08-12 14:15:58.000000000 +0900
@@ -7,6 +7,7 @@ dm-mod-objs	:= dm.o dm-table.o dm-target
 dm-multipath-objs := dm-path-selector.o dm-mpath.o
 dm-snapshot-objs := dm-snap.o dm-exception-store.o
 dm-mirror-objs	:= dm-raid1.o
+dm-ioband-objs	:= dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-type.o
 md-mod-objs     := md.o bitmap.o
 raid456-objs	:= raid5.o raid6algos.o raid6recov.o raid6tables.o \
 		   raid6int1.o raid6int2.o raid6int4.o \
@@ -36,6 +37,7 @@ obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipa
 obj-$(CONFIG_DM_SNAPSHOT)	+= dm-snapshot.o
 obj-$(CONFIG_DM_MIRROR)		+= dm-mirror.o dm-log.o
 obj-$(CONFIG_DM_ZERO)		+= dm-zero.o
+obj-$(CONFIG_DM_IOBAND)		+= dm-ioband.o
 
 quiet_cmd_unroll = UNROLL  $@
       cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-ctl.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-ctl.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-ctl.c	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-ctl.c	2008-08-12 14:15:58.000000000 +0900
@@ -0,0 +1,1335 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <taka@valinux.co.jp>
+ *          Ryo Tsuruta <ryov@valinux.co.jp>
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/raid/md.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+#define DM_MSG_PREFIX "ioband"
+#define POLICY_PARAM_START 6
+#define POLICY_PARAM_DELIM "=:,"
+
+static LIST_HEAD(ioband_device_list);
+/* to protect ioband_device_list */
+static DEFINE_SPINLOCK(ioband_devicelist_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, char *, char *);
+static int ioband_group_attach(struct ioband_group *, int, char *);
+static int ioband_group_type_select(struct ioband_group *, char *);
+
+long ioband_debug;	/* just for debugging */
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, char *name,
+						int argc, char **argv)
+{
+	struct policy_type *p;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r;
+
+	for (p = dm_ioband_policy_type; p->p_name; p++) {
+		if (!strcmp(name, p->p_name))
+			break;
+	}
+	if (!p->p_name)
+		return -EINVAL;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (dp->g_policy == p) {
+		/* do nothing if the same policy is already set */
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return 0;
+	}
+
+	suspend_ioband_device(dp, flags, 1);
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		dp->g_group_dtr(gp);
+
+	/* switch to the new policy */
+	dp->g_policy = p;
+	r = p->p_policy_init(dp, argc, argv);
+	if (!dp->g_hold_bio)
+		dp->g_hold_bio = ioband_hold_bio;
+	if (!dp->g_pop_bio)
+		dp->g_pop_bio = ioband_pop_bio;
+
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		dp->g_group_ctr(gp, NULL);
+	resume_ioband_device(dp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static struct ioband_device *alloc_ioband_device(char *name,
+					int io_throttle, int io_limit)
+
+{
+	struct ioband_device *dp, *new;
+	unsigned long flags;
+
+	new = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+	if (!new)
+		return NULL;
+
+	spin_lock_irqsave(&ioband_devicelist_lock, flags);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		if (!strcmp(dp->g_name, name)) {
+			dp->g_ref++;
+			spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+			kfree(new);
+			return dp;
+		}
+	}
+
+	/*
+	 * Prepare its own workqueue as generic_make_request() may
+	 * potentially block the workqueue when submitting BIOs.
+	 */
+	new->g_ioband_wq = create_workqueue("kioband");
+	if (!new->g_ioband_wq) {
+		spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+		kfree(new);
+		return NULL;
+	}
+
+	INIT_DELAYED_WORK(&new->g_conductor, ioband_conduct);
+	INIT_LIST_HEAD(&new->g_groups);
+	INIT_LIST_HEAD(&new->g_list);
+	spin_lock_init(&new->g_lock);
+	mutex_init(&new->g_lock_device);
+	bio_list_init(&new->g_urgent_bios);
+	new->g_io_throttle = io_throttle;
+	new->g_io_limit[0] = io_limit;
+	new->g_io_limit[1] = io_limit;
+	new->g_issued[0] = 0;
+	new->g_issued[1] = 0;
+	new->g_blocked = 0;
+	new->g_ref = 1;
+	new->g_flags = 0;
+	strlcpy(new->g_name, name, sizeof(new->g_name));
+	new->g_policy = NULL;
+	new->g_hold_bio = NULL;
+	new->g_pop_bio = NULL;
+	init_waitqueue_head(&new->g_waitq);
+	init_waitqueue_head(&new->g_waitq_suspend);
+	init_waitqueue_head(&new->g_waitq_flush);
+	list_add_tail(&new->g_list, &ioband_device_list);
+
+	spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+	return new;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ioband_devicelist_lock, flags);
+	dp->g_ref--;
+	if (dp->g_ref > 0) {
+		spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+		return;
+	}
+	list_del(&dp->g_list);
+	spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+	destroy_workqueue(dp->g_ioband_wq);
+	kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+						int wait_completion)
+{
+	struct ioband_group *gp;
+
+	if (wait_completion && dp->g_issued[0] + dp->g_issued[1] > 0)
+		return 0;
+	if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+		return 0;
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		if (waitqueue_active(&gp->c_waitq))
+			return 0;
+	return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+				unsigned long flags, int wait_completion)
+{
+	struct ioband_group *gp;
+
+	/* block incoming bios */
+	set_device_suspended(dp);
+
+	/* wake up all blocked processes and go down all ioband groups */
+	wake_up_all(&dp->g_waitq);
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!is_group_down(gp)) {
+			set_group_down(gp);
+			set_group_need_up(gp);
+		}
+		wake_up_all(&gp->c_waitq);
+	}
+
+	/* flush the already mapped bios */
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+
+	/* wait for all processes to wake up and bios to release */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	wait_event_lock_irq(dp->g_waitq_flush,
+			is_ioband_device_flushed(dp, wait_completion),
+			dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+
+	/* go up ioband groups */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (group_need_up(gp)) {
+			clear_group_need_up(gp);
+			clear_group_down(gp);
+		}
+	}
+
+	/* accept incoming bios */
+	wake_up_all(&dp->g_waitq_suspend);
+	clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(
+					struct ioband_group *head, int id)
+{
+	struct rb_node *node = head->c_group_root.rb_node;
+
+	while (node) {
+		struct ioband_group *p =
+			container_of(node, struct ioband_group, c_group_node);
+
+		if (p->c_id == id || id == IOBAND_ID_ANY)
+			return p;
+		node = (id < p->c_id) ? node->rb_left : node->rb_right;
+	}
+	return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root,
+						struct ioband_group *gp)
+{
+	struct rb_node **new = &root->rb_node, *parent = NULL;
+	struct ioband_group *p;
+
+	while (*new) {
+		p = container_of(*new, struct ioband_group, c_group_node);
+		parent = *new;
+		new = (gp->c_id < p->c_id) ?
+					&(*new)->rb_left : &(*new)->rb_right;
+	}
+
+	rb_link_node(&gp->c_group_node, parent, new);
+	rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_group *gp,
+    struct ioband_group *head, struct ioband_device *dp, int id, char *param)
+{
+	unsigned long flags;
+	int r;
+
+	INIT_LIST_HEAD(&gp->c_list);
+	bio_list_init(&gp->c_blocked_bios);
+	bio_list_init(&gp->c_prio_bios);
+	gp->c_id = id;	/* should be verified */
+	gp->c_blocked = 0;
+	gp->c_prio_blocked = 0;
+	memset(gp->c_stat, 0, sizeof(gp->c_stat));
+	init_waitqueue_head(&gp->c_waitq);
+	gp->c_flags = 0;
+	gp->c_group_root = RB_ROOT;
+	gp->c_banddev = dp;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (head && ioband_group_find(head, id)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		DMWARN("ioband_group: id=%d already exists.", id);
+		return -EEXIST;
+	}
+
+	list_add_tail(&gp->c_list, &dp->g_groups);
+
+	r = dp->g_group_ctr(gp, param);
+	if (r) {
+		list_del(&gp->c_list);
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return r;
+	}
+
+	if (head) {
+		ioband_group_add_node(&head->c_group_root, gp);
+		gp->c_dev = head->c_dev;
+		gp->c_target = head->c_target;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+						struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	list_del(&gp->c_list);
+	if (head)
+		rb_erase(&gp->c_group_node, &head->c_group_root);
+	dp->g_group_dtr(gp);
+	kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	while ((group = ioband_group_find(gp, IOBAND_ID_ANY)))
+		ioband_group_release(gp, group);
+	ioband_group_release(NULL, gp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		set_group_down(p);
+		if (suspend) {
+			set_group_suspended(p);
+			dprintk(KERN_ERR "ioband suspend: gp(%p)\n", p);
+		}
+	}
+	set_group_down(head);
+	if (suspend) {
+		set_group_suspended(head);
+		dprintk(KERN_ERR "ioband suspend: gp(%p)\n", head);
+	}
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node;
+							node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		clear_group_down(p);
+		clear_group_suspended(p);
+		dprintk(KERN_ERR "ioband resume: gp(%p)\n", p);
+	}
+	clear_group_down(head);
+	clear_group_suspended(head);
+	dprintk(KERN_ERR "ioband resume: gp(%p)\n", head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int split_string(char *s, long *id, char **v)
+{
+	char *p, *q;
+	int r = 0;
+
+	*id = IOBAND_ID_ANY;
+	p = strsep(&s, POLICY_PARAM_DELIM);
+	q = strsep(&s, POLICY_PARAM_DELIM);
+	if (!q) {
+		*v = p;
+	} else {
+		r = strict_strtol(p, 0, id);
+		*v = q;
+	}
+	return r;
+}
+
+/*
+ * Create a new band device:
+ *   parameters:  <device> <device-group-id> <io_throttle> <io_limit>
+ *     <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct ioband_group *gp;
+	struct ioband_device *dp;
+	struct dm_dev *dev;
+	int io_throttle;
+	int io_limit;
+	int i, r, start;
+	long val, id;
+	char *param;
+
+	if (argc < POLICY_PARAM_START) {
+		ti->error = "Requires " __stringify(POLICY_PARAM_START)
+							" or more arguments";
+		return -EINVAL;
+	}
+
+	if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+		ti->error = "Ioband device name is too long";
+		return -EINVAL;
+	}
+	dprintk(KERN_ERR "ioband_ctr ioband device name:%s\n", argv[1]);
+
+	r = strict_strtol(argv[2], 0, &val);
+	if (r || val < 0) {
+		ti->error = "Invalid io_throttle";
+		return -EINVAL;
+	}
+	io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+	r = strict_strtol(argv[3], 0, &val);
+	if (r || val < 0) {
+		ti->error = "Invalid io_limit";
+		return -EINVAL;
+	}
+	io_limit = val;
+
+	r = dm_get_device(ti, argv[0], 0, ti->len,
+				dm_table_get_mode(ti->table), &dev);
+	if (r) {
+		ti->error = "Device lookup failed";
+		return r;
+	}
+
+	if (io_limit == 0) {
+		struct request_queue *q;
+
+		q = bdev_get_queue(dev->bdev);
+		if (!q) {
+			ti->error = "Can't get queue size";
+			r = -ENXIO;
+			goto release_dm_device;
+		}
+		dprintk(KERN_ERR "ioband_ctr nr_requests:%lu\n",
+							q->nr_requests);
+		io_limit = q->nr_requests;
+	}
+
+	if (io_limit < io_throttle)
+		io_limit = io_throttle;
+	dprintk(KERN_ERR "ioband_ctr io_throttle:%d io_limit:%d\n",
+						io_throttle, io_limit);
+
+	dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+	if (!dp) {
+		ti->error = "Cannot create ioband device";
+		r = -EINVAL;
+		goto release_dm_device;
+	}
+
+	mutex_lock(&dp->g_lock_device);
+	r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+			argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+	if (r) {
+		ti->error = "Invalid policy parameter";
+		goto release_ioband_device;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp) {
+		ti->error = "Cannot allocate memory for ioband group";
+		r = -ENOMEM;
+		goto release_ioband_device;
+	}
+
+	ti->private = gp;
+	gp->c_target = ti;
+	gp->c_dev = dev;
+
+	/* Find a default group parameter */
+	for (start = POLICY_PARAM_START; start < argc; start++)
+		if (argv[start][0] == ':')
+			break;
+	param = (start < argc) ? &argv[start][1] : NULL;
+
+	/* Create a default ioband group */
+	r = ioband_group_init(gp, NULL, dp, IOBAND_ID_ANY, param);
+	if (r) {
+		kfree(gp);
+		ti->error = "Cannot create default ioband group";
+		goto release_ioband_device;
+	}
+
+	r = ioband_group_type_select(gp, argv[4]);
+	if (r) {
+		ti->error = "Cannot set ioband group type";
+		goto release_ioband_group;
+	}
+
+	/* Create sub ioband groups */
+	for (i = start + 1; i < argc; i++) {
+		r = split_string(argv[i], &id, &param);
+		if (r) {
+			ti->error = "Invalid ioband group parameter";
+			goto release_ioband_group;
+		}
+		r = ioband_group_attach(gp, id, param);
+		if (r) {
+			ti->error = "Cannot create ioband group";
+			goto release_ioband_group;
+		}
+	}
+	mutex_unlock(&dp->g_lock_device);
+	return 0;
+
+release_ioband_group:
+	ioband_group_destroy_all(gp);
+release_ioband_device:
+	mutex_unlock(&dp->g_lock_device);
+	release_ioband_device(dp);
+release_dm_device:
+	dm_put_device(ti, dev);
+	return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+
+	mutex_lock(&dp->g_lock_device);
+	ioband_group_stop_all(gp, 0);
+	cancel_delayed_work_sync(&dp->g_conductor);
+	dm_put_device(ti, gp->c_dev);
+	ioband_group_destroy_all(gp);
+	mutex_unlock(&dp->g_lock_device);
+	release_ioband_device(dp);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	/* Todo: The list should be split into a read list and a write list */
+	bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+	return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	/*
+	 * ToDo: A new flag should be added to struct bio, which indicates
+	 * 	it contains urgent I/O requests.
+	 */
+	if (!PageReclaim(page))
+		return 0;
+	if (PageSwapCache(page))
+		return 2;
+	return 1;
+}
+
+static inline void resume_to_accept_bios(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_device_blocked(dp)
+	    && dp->g_blocked < dp->g_io_limit[0]+dp->g_io_limit[1]) {
+		clear_device_blocked(dp);
+		wake_up_all(&dp->g_waitq);
+	}
+	if (is_group_blocked(gp)) {
+		clear_group_blocked(gp);
+		wake_up_all(&gp->c_waitq);
+	}
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_device_blocked(dp))
+		return 1;
+	if (dp->g_blocked >= dp->g_io_limit[0] + dp->g_io_limit[1]) {
+		set_device_blocked(dp);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_group_blocked(gp))
+		return 1;
+	if (dp->g_should_block(gp)) {
+		set_group_blocked(gp);
+		return 1;
+	}
+	return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+		/*
+		 * Kernel threads shouldn't be blocked easily since each of
+		 * them may handle BIOs for several groups on several
+		 * partitions.
+		 */
+		wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+						dp->g_lock, do_nothing());
+	} else {
+		wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+						dp->g_lock, do_nothing());
+	}
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+	return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_issued[bio_data_dir(bio)]++;
+	return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+	return dp->g_issued[0] < dp->g_io_limit[0]
+		|| dp->g_issued[1] < dp->g_io_limit[1];
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked++;
+	if (is_urgent_bio(bio)) {
+		/*
+		 * ToDo:
+		 * When barrier mode is supported, write bios sharing the same
+		 * file system with the currnt one would be all moved
+		 * to g_urgent_bios list.
+		 * You don't have to care about barrier handling if the bio
+		 * is for swapping.
+		 */
+		dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+		bio_list_add(&dp->g_urgent_bios, bio);
+	} else {
+		gp->c_blocked++;
+		dp->g_hold_bio(gp, bio);
+	}
+}
+
+static inline int room_for_bio_rw(struct ioband_device *dp, int direct)
+{
+	return dp->g_issued[direct] < dp->g_io_limit[direct];
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int direct)
+{
+	if (bio_list_empty(&gp->c_prio_bios))
+		set_prio_queue(gp, direct);
+	bio_list_add(&gp->c_prio_bios, bio);
+	gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+	struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		clear_prio_queue(gp);
+
+	if (bio)
+		gp->c_prio_blocked--;
+	return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+		 struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked--;
+	gp->c_blocked--;
+	if (!gp->c_blocked)
+		resume_to_accept_bios(gp);
+	if (should_pushback_bio(gp))
+		bio_list_add(pushback_list, bio);
+	else {
+		int rw = bio_data_dir(bio);
+
+		gp->c_stat[rw].deferred++;
+		gp->c_stat[rw].sectors += bio_sectors(bio);
+		bio_list_add(issue_list, bio);
+	}
+	return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+		struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+	struct bio *bio;
+
+	if (bio_list_empty(&dp->g_urgent_bios))
+		return;
+	while (room_for_bio_rw(dp, 1)) {
+		bio = bio_list_pop(&dp->g_urgent_bios);
+		if (!bio)
+			return;
+		dp->g_blocked--;
+		dp->g_issued[bio_data_dir(bio)]++;
+		bio_list_add(issue_list, bio);
+	}
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+		struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int direct;
+	int ret;
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		return R_OK;
+	direct = prio_queue_direct(gp);
+	while (gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio_rw(dp, direct))
+			return R_OK;
+		bio = pop_prio_bio(gp);
+		if (!bio)
+			return R_OK;
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+		struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int direct;
+	int ret;
+
+	while (gp->c_blocked - gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio(dp))
+			return R_OK;
+		bio = dp->g_pop_bio(gp);
+		if (!bio)
+			return R_OK;
+
+		direct = bio_data_dir(bio);
+		if (!room_for_bio_rw(dp, direct)) {
+			push_prio_bio(gp, bio, direct);
+			continue;
+		}
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+		struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+	int ret = release_prio_bios(gp, issue_list, pushback_list);
+	if (ret)
+		return ret;
+	return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+							struct bio *bio)
+{
+	struct ioband_group *gp;
+
+	if (!head->c_type->t_getid)
+		return head;
+
+	gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+	if (!gp)
+		gp = head;
+	return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+						union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int rw;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	/*
+	 * The device is suspended while some of the ioband device
+	 * configurations are being changed.
+	 */
+	if (is_device_suspended(dp))
+		wait_event_lock_irq(dp->g_waitq_suspend,
+			!is_device_suspended(dp), dp->g_lock, do_nothing());
+
+	gp = ioband_group_get(gp, bio);
+	prevent_burst_bios(gp, bio);
+	if (should_pushback_bio(gp)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return DM_MAPIO_REQUEUE;
+	}
+
+	bio->bi_bdev = gp->c_dev->bdev;
+	bio->bi_sector -= ti->begin;
+	rw = bio_data_dir(bio);
+
+	if (!gp->c_blocked && room_for_bio_rw(dp, rw)) {
+		if (dp->g_can_submit(gp)) {
+			prepare_to_issue(gp, bio);
+			gp->c_stat[rw].immediate++;
+			gp->c_stat[rw].sectors += bio_sectors(bio);
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return DM_MAPIO_REMAPPED;
+		} else if (!dp->g_blocked
+				&& dp->g_issued[0] + dp->g_issued[1] == 0) {
+			dprintk(KERN_ERR "ioband_map: token expired "
+					"gp:%p bio:%p\n", gp, bio);
+			queue_delayed_work(dp->g_ioband_wq,
+							&dp->g_conductor, 1);
+		}
+	}
+	hold_bio(gp, bio);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+	struct ioband_group *best = NULL;
+	int	highest = 0;
+	int	pri;
+
+	/* Todo: The algorithm should be optimized.
+	 *       It would be better to use rbtree.
+	 */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!gp->c_blocked || !room_for_bio(dp))
+			continue;
+		if (gp->c_blocked == gp->c_prio_blocked
+			&& !room_for_bio_rw(dp, prio_queue_direct(gp))) {
+			continue;
+		}
+		pri = dp->g_can_submit(gp);
+		if (pri > highest) {
+			highest = pri;
+			best = gp;
+		}
+	}
+
+	return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+	struct ioband_device *dp =
+		container_of(work, struct ioband_device, g_conductor.work);
+	struct ioband_group *gp = NULL;
+	struct bio *bio;
+	unsigned long flags;
+	struct bio_list issue_list, pushback_list;
+
+	bio_list_init(&issue_list);
+	bio_list_init(&pushback_list);
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	release_urgent_bios(dp, &issue_list, &pushback_list);
+	if (dp->g_blocked) {
+		gp = choose_best_group(dp);
+		if (gp && release_bios(gp, &issue_list, &pushback_list)
+								== R_YIELD)
+			queue_delayed_work(dp->g_ioband_wq,
+							&dp->g_conductor, 0);
+	}
+
+	if (dp->g_blocked && room_for_bio_rw(dp, 0) && room_for_bio_rw(dp, 1) &&
+		bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+		dp->g_restart_bios(dp)) {
+		dprintk(KERN_ERR "ioband_conduct: token expired dp:%p "
+			"issued(%d,%d) g_blocked(%d)\n", dp,
+			 dp->g_issued[0], dp->g_issued[1], dp->g_blocked);
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	}
+
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	while ((bio = bio_list_pop(&issue_list)))
+		generic_make_request(bio);
+	while ((bio = bio_list_pop(&pushback_list)))
+		bio_endio(bio, -EIO);
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+				int error, union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int r = error;
+
+	/*
+	 *  XXX: A new error code for device mapper devices should be used
+	 *       rather than EIO.
+	 */
+	if (error == -EIO && should_pushback_bio(gp)) {
+		/* This ioband device is suspending */
+		r = DM_ENDIO_REQUEUE;
+	}
+	/*
+	 * Todo: The algorithm should be optimized to eliminate the spinlock.
+	 */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	dp->g_issued[bio_data_dir(bio)]--;
+
+	/*
+	 * Todo: It would be better to introduce high/low water marks here
+	 * 	 not to kick the workqueues so often.
+	 */
+	if (dp->g_blocked)
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	else if (is_device_suspended(dp)
+				&& dp->g_issued[0] + dp->g_issued[1] == 0)
+		wake_up_all(&dp->g_waitq_flush);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+
+	mutex_lock(&dp->g_lock_device);
+	ioband_group_stop_all(gp, 1);
+	mutex_unlock(&dp->g_lock_device);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+
+	mutex_lock(&dp->g_lock_device);
+	ioband_group_resume_all(gp);
+	mutex_unlock(&dp->g_lock_device);
+}
+
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group_stat *stat;
+	int i, sz = *szp; /* used in DMEMIT() */
+
+	DMEMIT(" %d", gp->c_id);
+	for (i = 0; i < 2; i++) {
+		stat = &gp->c_stat[i];
+		DMEMIT(" %lu %lu %lu",
+			stat->immediate + stat->deferred, stat->deferred,
+			stat->sectors);
+	}
+	*szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = 0;	/* used in DMEMIT() */
+	unsigned long flags;
+
+	mutex_lock(&dp->g_lock_device);
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		spin_lock_irqsave(&dp->g_lock, flags);
+		DMEMIT("%s", dp->g_name);
+		ioband_group_status(gp, &sz, result, maxlen);
+		for (node = rb_first(&gp->c_group_root); node;
+						node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			ioband_group_status(p, &sz, result, maxlen);
+		}
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		break;
+
+	case STATUSTYPE_TABLE:
+		spin_lock_irqsave(&dp->g_lock, flags);
+		DMEMIT("%s %s %d %d %s %s",
+				gp->c_dev->name, dp->g_name,
+				dp->g_io_throttle, dp->g_io_limit[0],
+				gp->c_type->t_name, dp->g_policy->p_name);
+		dp->g_show(gp, &sz, result, maxlen);
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		break;
+	}
+
+	mutex_unlock(&dp->g_lock_device);
+	return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, char *name)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct group_type *t;
+	unsigned long flags;
+
+	for (t = dm_ioband_group_type; (t->t_name); t++) {
+		if (!strcmp(name, t->t_name))
+			break;
+	}
+	if (!t->t_name) {
+		DMWARN("ioband type select: %s isn't supported.", name);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return -EBUSY;
+	}
+	gp->c_type = t;
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp, char *cmd, char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	char *val_str;
+	long id;
+	unsigned long flags;
+	int r;
+
+	r = split_string(value, &id, &val_str);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (id != IOBAND_ID_ANY) {
+		gp = ioband_group_find(gp, id);
+		if (!gp) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			DMWARN("ioband_set_param: id=%ld not found.", id);
+			return -EINVAL;
+		}
+	}
+	r = dp->g_set_param(gp, cmd, val_str);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static int ioband_group_attach(struct ioband_group *gp, int id, char *param)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *sub_gp;
+	int r;
+
+	if (id < 0) {
+		DMWARN("ioband_group_attach: invalid id:%d", id);
+		return -EINVAL;
+	}
+	if (!gp->c_type->t_getid) {
+		DMWARN("ioband_group_attach: "
+		       "no ioband group type is specified");
+		return -EINVAL;
+	}
+
+	sub_gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!sub_gp)
+		return -ENOMEM;
+
+	r = ioband_group_init(sub_gp, gp, dp, id, param);
+	if (r < 0) {
+		kfree(sub_gp);
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *gp, int id)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *sub_gp;
+	unsigned long flags;
+
+	if (id < 0) {
+		DMWARN("ioband_group_detach: invalid id:%d", id);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	sub_gp = ioband_group_find(gp, id);
+	if (!sub_gp) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		DMWARN("ioband_group_detach: invalid id:%d", id);
+		return -EINVAL;
+	}
+
+	/*
+	 * Todo: Calling suspend_ioband_device() before releasing the
+	 *       ioband group has a large overhead. Need improvement.
+	 */
+	suspend_ioband_device(dp, flags, 0);
+	ioband_group_release(gp, sub_gp);
+	resume_ioband_device(dp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+/*
+ * Message parameters:
+ *	"policy"      <name>
+ *       ex)
+ *		"policy" "weight"
+ *	"type"        "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * 	"io_throttle" <value>
+ * 	"io_limit"    <value>
+ *	"attach"      <group id>
+ *	"detach"      <group id>
+ *	"any-command" <group id>:<value>
+ *       ex)
+ *		"weight" 0:<value>
+ *		"token"  24:<value>
+ */
+static int __ioband_message(struct dm_target *ti,
+					unsigned int argc, char **argv)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	long val;
+	int r = 0;
+	unsigned long flags;
+
+	if (argc == 1 && !strcmp(argv[0], "reset")) {
+		spin_lock_irqsave(&dp->g_lock, flags);
+		memset(gp->c_stat, 0, sizeof(gp->c_stat));
+		for (node = rb_first(&gp->c_group_root); node;
+						 node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			memset(p->c_stat, 0, sizeof(p->c_stat));
+		}
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return 0;
+	}
+
+	if (argc != 2) {
+		DMWARN("Unrecognised band message received.");
+		return -EINVAL;
+	}
+	if (!strcmp(argv[0], "debug")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+		ioband_debug = val;
+		return 0;
+	} else if (!strcmp(argv[0], "io_throttle")) {
+		r = strict_strtol(argv[1], 0, &val);
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (r || val < 0 ||
+			val > dp->g_io_limit[0] || val > dp->g_io_limit[1]) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "io_limit")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val == 0) {
+			struct request_queue *q;
+
+			q = bdev_get_queue(gp->c_dev->bdev);
+			if (!q) {
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				return -ENXIO;
+			}
+			val = q->nr_requests;
+		}
+		if (val < dp->g_io_throttle) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_limit[0] = dp->g_io_limit[1] = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "type")) {
+		return ioband_group_type_select(gp, argv[1]);
+	} else if (!strcmp(argv[0], "attach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_attach(gp, val, NULL);
+	} else if (!strcmp(argv[0], "detach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_detach(gp, val);
+	} else if (!strcmp(argv[0], "policy")) {
+		r = policy_init(dp, argv[1], 0, &argv[2]);
+		return r;
+	} else {
+		/* message anycommand <group-id>:<value> */
+		r = ioband_set_param(gp, argv[0], argv[1]);
+		if (r < 0)
+			DMWARN("Unrecognised band message received.");
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	int r;
+
+	mutex_lock(&dp->g_lock_device);
+	r = __ioband_message(ti, argc, argv);
+	mutex_unlock(&dp->g_lock_device);
+	return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+					struct bio_vec *biovec, int max_size)
+{
+	struct ioband_group *gp = ti->private;
+	struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = gp->c_dev->bdev;
+	bvm->bi_sector -= ti->begin;
+
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static struct target_type ioband_target = {
+	.name	     = "ioband",
+	.module      = THIS_MODULE,
+	.version     = {1, 4, 0},
+	.ctr	     = ioband_ctr,
+	.dtr	     = ioband_dtr,
+	.map	     = ioband_map,
+	.end_io	     = ioband_end_io,
+	.presuspend  = ioband_presuspend,
+	.resume	     = ioband_resume,
+	.status	     = ioband_status,
+	.message     = ioband_message,
+	.merge       = ioband_merge,
+};
+
+static int __init dm_ioband_init(void)
+{
+	int r;
+
+	r = dm_register_target(&ioband_target);
+	if (r < 0) {
+		DMERR("register failed %d", r);
+		return r;
+	}
+	return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+	int r;
+
+	r = dm_unregister_target(&ioband_target);
+	if (r < 0)
+		DMERR("unregister failed %d", r);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi <taka@valinux.co.jp>, "
+	      "Ryo Tsuruta <ryov@valinux.co.jp");
+MODULE_LICENSE("GPL");
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-policy.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-policy.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-policy.c	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-policy.c	2008-08-12 14:15:58.000000000 +0900
@@ -0,0 +1,447 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT		100
+#define DEFAULT_TOKENPOOL	2048
+#define DEFAULT_BUCKET		2
+#define IOBAND_IOPRIO_BASE	100
+#define TOKEN_BATCH_UNIT	20
+#define PROCEED_THRESHOLD	8
+#define	LOCAL_ACTIVE_RATIO	8
+#define	GLOBAL_ACTIVE_RATIO	16
+#define OVERCOMMIT_RATE		4
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token = gp->c_token;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (allowance) {
+		if (allowance > dp->g_carryover)
+			allowance = dp->g_carryover;
+		token += gp->c_token_initial * allowance;
+	}
+	if (is_group_down(gp))
+		token += gp->c_token_initial * dp->g_carryover * 2;
+
+	return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+	return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+	struct ioband_group *gp = dp->g_dominant;
+
+	/*
+	 * Don't make a new epoch if the dominant group still has a lot of
+	 * tokens, except when the I/O load is low.
+	 */
+	if (gp) {
+		int iopri = iopriority(gp);
+		if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+			dp->g_issued[0] + dp->g_issued[1] >= dp->g_io_throttle)
+		return 0;
+	}
+
+	dp->g_epoch++;
+	dprintk(KERN_ERR "make_epoch %d --> %d\n",
+						dp->g_epoch-1, dp->g_epoch);
+
+	/* The leftover tokens will be used in the next epoch. */
+	dp->g_token_extra = dp->g_token_left;
+	if (dp->g_token_extra < 0)
+		dp->g_token_extra = 0;
+	dp->g_token_left = dp->g_token_bucket;
+
+	dp->g_expired = NULL;
+	dp->g_dominant = NULL;
+
+	return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (!allowance)
+		return 0;
+	if (allowance > dp->g_carryover)
+		allowance = dp->g_carryover;
+	gp->c_my_epoch = dp->g_epoch;
+	return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance;
+	int delta;
+	int extra;
+
+	if (gp->c_token > 0)
+		return iopriority(gp);
+
+	if (is_group_down(gp)) {
+		gp->c_token = gp->c_token_initial;
+		return iopriority(gp);
+	}
+	allowance = make_epoch(gp);
+	if (!allowance)
+		return 0;
+	/*
+	 * If this group has the right to get tokens for several epochs,
+	 * give all of them to the group here.
+	 */
+	delta = gp->c_token_initial * allowance;
+	dp->g_token_left -= delta;
+	/*
+	 * Give some extra tokens to this group when there have left unused
+	 * tokens on this ioband device from the previous epoch.
+	 */
+	extra = dp->g_token_extra * gp->c_token_initial /
+				 (dp->g_token_bucket - dp->g_token_extra/2);
+	delta += extra;
+	gp->c_token += delta;
+	gp->c_consumed = 0;
+
+	if (gp == dp->g_current)
+		dp->g_yield_mark += delta;
+	dprintk(KERN_ERR "refill token: "
+		"gp:%p token:%d->%d extra(%d) allowance(%d)\n",
+		gp, gp->c_token - delta, gp->c_token, extra, allowance);
+	if (gp->c_token > 0)
+		return iopriority(gp);
+	dprintk(KERN_ERR "refill token: yet empty gp:%p token:%d\n",
+						gp, gp->c_token);
+	return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+		gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+		; /* Do nothing unless this group is really active. */
+	} else if (!dp->g_dominant ||
+			get_token(gp) > get_token(dp->g_dominant)) {
+		/*
+		 * Regard this group as the dominant group on this
+		 * ioband device when it has larger number of tokens
+		 * than those of the previous one.
+		 */
+		dp->g_dominant = gp;
+	}
+	if (dp->g_epoch == gp->c_my_epoch &&
+			gp->c_token > 0 && gp->c_token - count <= 0) {
+		/* Remember the last group which used up its own tokens. */
+		dp->g_expired = gp;
+		if (dp->g_dominant == gp)
+			dp->g_dominant = NULL;
+	}
+
+	if (gp != dp->g_current) {
+		/* This group is the current already. */
+		dp->g_current = gp;
+		dp->g_yield_mark =
+			gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+	}
+	gp->c_token -= count;
+	gp->c_consumed += count;
+	if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+		/*
+		 * Return-value 1 means that this policy requests dm-ioband
+		 * to give a chance to another group to be selected since
+		 * this group has already issued enough amount of I/Os.
+		 */
+		dp->g_current = NULL;
+		return R_YIELD;
+	}
+	/*
+	 * Return-value 0 means that this policy allows dm-ioband to select
+	 * this group to issue I/Os without a break.
+	 */
+	return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+	return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+	return gp->c_blocked >= gp->c_limit;
+}
+
+static void set_weight(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	dp->g_weight_total += (new - gp->c_weight);
+	gp->c_weight = new;
+
+	if (dp->g_weight_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_token = p->c_token_initial = p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_token = p->c_token_initial =
+				dp->g_token_bucket * p->c_weight /
+				dp->g_weight_total + 1;
+			p->c_limit = (dp->g_io_limit[0] + dp->g_io_limit[1]) *
+				p->c_weight / dp->g_weight_total /
+				OVERCOMMIT_RATE + 1;
+		}
+	}
+}
+
+static void init_token_bucket(struct ioband_device *dp, int val)
+{
+	dp->g_token_bucket = ((dp->g_io_limit[0] + dp->g_io_limit[1]) *
+				 DEFAULT_BUCKET) << dp->g_token_unit;
+	if (!val)
+		val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+	else if (val < dp->g_token_bucket)
+		val = dp->g_token_bucket;
+	dp->g_carryover = val/dp->g_token_bucket;
+	dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp, char *cmd, char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	long val;
+	int r = 0, err;
+
+	err = strict_strtol(value, 0, &val);
+	if (!strcmp(cmd, "weight")) {
+		if (!err && 0 < val && val <= SHORT_MAX)
+			set_weight(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "token")) {
+		if (!err && val > 0) {
+			init_token_bucket(dp, val);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "io_limit")) {
+		init_token_bucket(dp, dp->g_token_bucket * dp->g_carryover);
+		set_weight(gp, gp->c_weight);
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, char *arg)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (!arg)
+		arg = __stringify(DEFAULT_WEIGHT);
+	gp->c_my_epoch = dp->g_epoch;
+	gp->c_weight = 0;
+	gp->c_consumed = 0;
+	return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	set_weight(gp, 0);
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp; /* used in DMEMIT() */
+
+	DMEMIT(" %d :%d", dp->g_token_bucket * dp->g_carryover, gp->c_weight);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d", p->c_id, p->c_weight);
+	}
+	*szp = sz;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = is_token_left;
+	dp->g_prepare_bio = prepare_token;
+	dp->g_restart_bios = make_global_epoch;
+	dp->g_group_ctr = policy_weight_ctr;
+	dp->g_group_dtr = policy_weight_dtr;
+	dp->g_set_param = policy_weight_param;
+	dp->g_should_block = is_queue_full;
+	dp->g_show  = policy_weight_show;
+
+	dp->g_epoch = 0;
+	dp->g_weight_total = 0;
+	dp->g_current = NULL;
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+	dp->g_token_extra = 0;
+	dp->g_token_unit = 0;
+	init_token_bucket(dp, val);
+	dp->g_token_left = dp->g_token_bucket;
+
+	return 0;
+}
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int w2_prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+	/* Consume tokens depending on the size of a given bio. */
+	return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int w2_policy_weight_init(struct ioband_device *dp,
+							int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+	}
+
+	r = policy_weight_init(dp, argc, argv);
+	if (r < 0)
+		return r;
+
+	dp->g_prepare_bio = w2_prepare_token;
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_token_bucket(dp, val);
+	dp->g_token_left = dp->g_token_bucket;
+	return 0;
+}
+/* weight balancing policy based on I/O size. --- End --- */
+
+
+static int policy_default_init(struct ioband_device *dp,
+					int argc, char **argv)
+{
+	return policy_weight_init(dp, argc, argv);
+}
+
+struct policy_type dm_ioband_policy_type[] = {
+	{"default", policy_default_init},
+	{"weight", policy_weight_init},
+	{"weight-iosize", w2_policy_weight_init},
+	{NULL,     policy_default_init}
+};
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-type.c	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-type.c	2008-08-12 14:15:58.000000000 +0900
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+	/*
+	 * This function will work for KVM and Xen.
+	 */
+	return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+	return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+	return (int)current->uid;
+}
+
+static int ioband_gid(struct bio *bio)
+{
+	return (int)current->gid;
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+  /*
+   * This function should return the ID of the cgroup which issued "bio".
+   * The ID of the cgroup which the current process belongs to won't be
+   * suitable ID for this purpose, since some BIOs will be handled by kernel
+   * threads like aio or pdflush on behalf of the process requesting the BIOs.
+   */
+	return 0;	/* not implemented yet */
+}
+
+struct group_type dm_ioband_group_type[] = {
+	{"none",   NULL},
+	{"pgrp",   ioband_process_group},
+	{"pid",    ioband_process_id},
+	{"node",   ioband_node},
+	{"cpuset", ioband_cpuset},
+	{"cgroup", ioband_cgroup},
+	{"user",   ioband_uid},
+	{"uid",    ioband_uid},
+	{"gid",    ioband_gid},
+	{NULL,     NULL}
+};
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband.h linux-2.6.27-rc1-mm1/drivers/md/dm-ioband.h
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband.h	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband.h	2008-08-12 14:15:58.000000000 +0900
@@ -0,0 +1,190 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DEFAULT_IO_THROTTLE	4
+#define DEFAULT_IO_LIMIT	128
+#define IOBAND_NAME_MAX 31
+#define IOBAND_ID_ANY (-1)
+
+struct ioband_group;
+
+struct ioband_device {
+	struct list_head	g_groups;
+	struct delayed_work     g_conductor;
+	struct workqueue_struct	*g_ioband_wq;
+	struct	bio_list	g_urgent_bios;
+	int	g_io_throttle;
+	int	g_io_limit[2];
+	int	g_issued[2];
+	int	g_blocked;
+	spinlock_t	g_lock;
+	struct mutex	g_lock_device;
+	wait_queue_head_t g_waitq;
+	wait_queue_head_t g_waitq_suspend;
+	wait_queue_head_t g_waitq_flush;
+
+	int	g_ref;
+	struct	list_head g_list;
+	int	g_flags;
+	char	g_name[IOBAND_NAME_MAX + 1];
+	struct	policy_type *g_policy;
+
+	/* policy dependent */
+	int	(*g_can_submit)(struct ioband_group *);
+	int	(*g_prepare_bio)(struct ioband_group *, struct bio *, int);
+	int	(*g_restart_bios)(struct ioband_device *);
+	void	(*g_hold_bio)(struct ioband_group *, struct bio *);
+	struct bio * (*g_pop_bio)(struct ioband_group *);
+	int	(*g_group_ctr)(struct ioband_group *, char *);
+	void	(*g_group_dtr)(struct ioband_group *);
+	int	(*g_set_param)(struct ioband_group *, char *cmd, char *value);
+	int	(*g_should_block)(struct ioband_group *);
+	void	(*g_show)(struct ioband_group *, int *, char *, unsigned int);
+
+	/* members for weight balancing policy */
+	int	g_epoch;
+	int	g_weight_total;
+		/* the number of tokens which can be used in every epoch */
+	int	g_token_bucket;
+		/* how many epochs tokens can be carried over */
+	int	g_carryover;
+		/* how many tokens should be used for one page-sized I/O */
+	int	g_token_unit;
+		/* the last group which used a token */
+	struct ioband_group *g_current;
+		/* give another group a chance to be scheduled when the rest
+		   of tokens of the current group reaches this mark */
+	int	g_yield_mark;
+		/* the latest group which used up its tokens */
+	struct ioband_group *g_expired;
+		/* the group which has the largest number of tokens in the
+		   active groups */
+	struct ioband_group *g_dominant;
+		/* the number of unused tokens in this epoch */
+	int	g_token_left;
+		/* left-over tokens from the previous epoch */
+	int	g_token_extra;
+};
+
+struct ioband_group_stat {
+	unsigned long	sectors;
+	unsigned long	immediate;
+	unsigned long	deferred;
+};
+
+struct ioband_group {
+	struct	list_head c_list;
+	struct ioband_device *c_banddev;
+	struct dm_dev *c_dev;
+	struct dm_target *c_target;
+	struct	bio_list c_blocked_bios;
+	struct	bio_list c_prio_bios;
+	struct	rb_root c_group_root;
+	struct  rb_node c_group_node;
+	int	c_id;	/* should be unsigned long or unsigned long long */
+	char	c_name[IOBAND_NAME_MAX + 1];	/* rfu */
+	int	c_blocked;
+	int	c_prio_blocked;
+	wait_queue_head_t c_waitq;
+	int	c_flags;
+	struct	ioband_group_stat c_stat[2];	/* hold rd/wr status */
+	struct	group_type *c_type;
+
+	/* members for weight balancing policy */
+	int	c_weight;
+	int	c_my_epoch;
+	int	c_token;
+	int	c_token_initial;
+	int	c_limit;
+	int     c_consumed;
+
+	/* rfu */
+	/* struct bio_list	c_ordered_tag_bios; */
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED		1
+#define DEV_SUSPENDED		2
+
+#define set_device_blocked(dp)		((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp)	((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp)		((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp)	((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp)	((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp)		((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_WRITE	1
+#define IOG_PRIO_QUEUE		2
+#define IOG_BIO_BLOCKED		4
+#define IOG_GOING_DOWN		8
+#define IOG_SUSPENDED		16
+#define IOG_NEED_UP		32
+
+#define R_OK		0
+#define R_BLOCK		1
+#define R_YIELD		2
+
+#define set_group_blocked(gp)		((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp)		((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp)		((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp)		((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp)		((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp)		((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp)		((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp)	((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp)		((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp)		((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp)		((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp)		((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_read(gp)		((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_read(gp)		((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_read(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == IOG_PRIO_QUEUE)
+
+#define set_prio_write(gp) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+#define clear_prio_write(gp) \
+	((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+#define is_prio_write(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == \
+		(IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+
+#define set_prio_queue(gp, direct) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|direct))
+#define clear_prio_queue(gp)		clear_prio_write(gp)
+#define is_prio_queue(gp)		((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_direct(gp)		((gp)->c_flags & IOG_PRIO_BIO_WRITE)
+
+
+struct policy_type {
+	const char *p_name;
+	int	  (*p_policy_init)(struct ioband_device *, int, char **);
+};
+
+extern struct policy_type dm_ioband_policy_type[];
+
+struct group_type {
+	const char *t_name;
+	int	  (*t_getid)(struct bio *);
+};
+
+extern struct group_type dm_ioband_group_type[];
+
+/* Just for debugging */
+extern long ioband_debug;
+#define dprintk(format, a...) \
+	if (ioband_debug > 0) ioband_debug--, printk(format, ##a)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples
  2008-08-12 12:33 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
@ 2008-08-12 12:34   ` Ryo Tsuruta
  2008-08-12 12:34     ` [PATCH 3/7] bio-cgroup: Introduction Ryo Tsuruta
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:34 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

Here is the documentation of design overview, installation, command
reference and examples.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt
--- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt	2008-08-12 14:15:58.000000000 +0900
@@ -0,0 +1,937 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same physical device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to its weight, which each job can set its own value to.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the bio-cgroup patch, which can be found at
+   [8]http://people.valinux.co.jp/~ryov/bio-cgroup/.
+
+     +------+ +------+ +------+   +------+ +------+ +------+
+     |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
+     |  A   | |  B   | |others|   |  X   | |  Y   | |others|
+     +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+
+     +--V----+---V---+----V---+   +--V----+---V---+----V---+
+     | group | group | default|   | group | group | default|  ioband groups
+     |       |       |  group |   |       |       |  group |
+     +-------+-------+--------+   +-------+-------+--------+
+     |        ioband1         |   |       ioband2          |  ioband devices
+     +-----------|------------+   +-----------|------------+
+     +-----------V--------------+-------------V------------+
+     |                          |                          |
+     |          sdb1            |           sdb2           |  physical devices
+     +--------------------------+--------------------------+
+
+
+   --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+     Dm-ioband is flexible to configure the bandwidth settings.
+
+     Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each physical
+   device to control its bandwidth.
+
+     Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process so that it equally effects to all block
+   devices.
+
+   --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+     Every ioband device has one ioband group, which by default is called the
+   default group.
+
+     Ioband devices can also have extra ioband groups in them. Each ioband
+   group has a job to support and a weight. Proportional to the weight,
+   dm-ioband gives tokens to the group.
+
+     A group passes on I/O requests that its job issues to the underlying
+   layer so long as it has tokens left, while requests are blocked if there
+   aren't any tokens left in the group. Tokens are refilled once all of
+   groups that have requests on a given physical device use up their tokens.
+
+     There are two policies for token consumption. One is that a token is
+   consumed for each I/O request. The other is that a token is consumed for
+   each I/O sector, for example, one I/O request which consists of
+   4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose
+   either policy.
+
+     With this approach, a job running on an ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --------------------------------------------------------------------------
+
+Setup and Installation
+
+     Build a kernel with these options enabled:
+
+     CONFIG_MD
+     CONFIG_BLK_DEV_DM
+     CONFIG_DM_IOBAND
+
+
+     If compiled as module, use modprobe to load dm-ioband.
+
+     # make modules
+     # make modules_install
+     # depmod -a
+     # modprobe dm-ioband
+
+
+     "dmsetup targets" command shows all available device-mapper targets.
+   "ioband" is displayed if dm-ioband has been loaded.
+
+     # dmsetup targets
+     ioband           v1.5.0
+
+
+   --------------------------------------------------------------------------
+
+Getting started
+
+     The following is a brief description how to control the I/O bandwidth of
+   disks. In this description, we'll take one disk with two partitions as an
+   example target.
+
+   --------------------------------------------------------------------------
+
+  Create and map ioband devices
+
+     Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+   to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+   and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+   the bandwidth of the physical disk "/dev/sda" while "ioband2" can use 20%.
+
+    # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+        "weight 0 :40" | dmsetup create ioband1
+    # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+        "weight 0 :10" | dmsetup create ioband2
+
+
+     If the commands are successful then the device files
+   "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+   --------------------------------------------------------------------------
+
+  Additional bandwidth control
+
+     In this example two extra ioband groups are created on "ioband1". The
+   first group consists of all the processes with user-id 1000 and the second
+   group consists of all the processes with user-id 2000. Their weights are
+   30 and 20 respectively.
+
+    # dmsetup message ioband1 0 type user
+    # dmsetup message ioband1 0 attach 1000
+    # dmsetup message ioband1 0 attach 2000
+    # dmsetup message ioband1 0 weight 1000:30
+    # dmsetup message ioband1 0 weight 2000:20
+
+
+     Now the processes in the user-id 1000 group can use 30% ---
+   30/(30+20+40+10)*100 --- of the bandwidth of the physical disk.
+
+   Table 1. Weight assignments
+
+   +----------------------------------------------------------------+
+   | ioband device |          ioband group          | ioband weight |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 1000                   | 30            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 2000                   | 20            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | default group(the other users) | 40            |
+   |---------------+--------------------------------+---------------|
+   | ioband2       | default group                  | 10            |
+   +----------------------------------------------------------------+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband devices
+
+     Remove the ioband devices when no longer used.
+
+     # dmsetup remove ioband1
+     # dmsetup remove ioband2
+
+
+   --------------------------------------------------------------------------
+
+Command Reference
+
+  Create an ioband device
+
+   SYNOPSIS
+
+           dmsetup create IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Create an ioband device with the given name IOBAND_DEVICE.
+           Generally, dmsetup reads a table from standard input. Each line of
+           the table specifies a single target and is of the form:
+
+             start_sector num_sectors "ioband" device_file ioband_device_id \
+                 io_throttle io_limit ioband_group_type policy token_base \
+                 :weight [ioband_group_id:weight...]
+
+
+                start_sector, num_sectors
+
+                          The sector range of the underlying device where
+                        dm-ioband maps.
+
+                ioband
+
+                          Specify the string "ioband" as a target type.
+
+                device_file
+
+                          Underlying device name.
+
+                ioband_device_id
+
+                          The ID number for an ioband device. The same ID
+                        must be set among the ioband devices that share the
+                        same bandwidth, which means they work on the same
+                        physical disk.
+
+                io_throttle
+
+                          Dm-ioband starts to control the bandwidth when the
+                        number of BIOs in progress exceeds this value. If 0
+                        is specified, dm-ioband uses the default value.
+
+                io_limit
+
+                          Dm-ioband blocks all I/O requests for the
+                        IOBAND_DEVICE when the number of BIOs in progress
+                        exceeds this value. If 0 is specified, dm-ioband uses
+                        the default value.
+
+                ioband_group_type
+
+                          Specify how to evaluate the ioband group ID. The
+                        type must be one of "none", "user", "gid", "pid" or
+                        "pgrp." The type "cgroup" is enabled by applying the
+                        bio-cgroup patch. Specify "none" if you don't need
+                        any ioband groups other than the default ioband
+                        group.
+
+                policy
+
+                          Specify bandwidth control policy. A user can choose
+                        either policy "weight" or "weight-iosize."
+
+                             weight
+
+                                       This policy controls bandwidth
+                                     according to the proportional to the
+                                     weight of each ioband group based on the
+                                     number of I/O requests.
+
+                             weight-iosize
+
+                                       This policy controls bandwidth
+                                     according to the proportional to the
+                                     weight of each ioband group based on the
+                                     number of I/O sectors.
+
+                token_base
+
+                          The number of tokens which specified by token_base
+                        will be distributed to all ioband groups according to
+                        the proportional to the weight of each ioband group.
+                        If 0 is specified, dm-ioband uses the default value.
+
+                ioband_group_id:weight
+
+                          Set the weight of the ioband group specified by
+                        ioband_group_id. If ioband_group_id is omitted, the
+                        weight is assigned to the default ioband group.
+
+   EXAMPLE
+
+             Create an ioband device with the following parameters:
+
+              *   Starting sector = "0"
+
+              *   The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+              *   Target type = "ioband"
+
+              *   Underlying device name = "/dev/sda1"
+
+              *   Ioband device ID = "128"
+
+              *   I/O throttle = "10"
+
+              *   I/O limit = "400"
+
+              *   Ioband group type = "user"
+
+              *   Bandwidth control policy = "weight"
+
+              *   Token base = "2048"
+
+              *   Weight for the default ioband group = "100"
+
+              *   Weight for the ioband group 1000 = "80"
+
+              *   Weight for the ioband group 2000 = "20"
+
+              *   Ioband device name = "ioband1"
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband" \
+               "/dev/sda1 128 10 400 user weight 2048 :100 1000:80 2000:20" \
+               | dmsetup create ioband1
+
+
+             Create two device groups (ID=1,2). The bandwidths of these
+           device groups will be individually controlled.
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+               "0 0 none weight 0 :80" | dmsetup create ioband1
+             # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+               "0 0 none weight 0 :20" | dmsetup create ioband2
+             # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+               "0 0 none weight 0 :60" | dmsetup create ioband3
+             # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+               "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband device
+
+   SYNOPSIS
+
+           dmsetup remove IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Remove the specified ioband device IOBAND_DEVICE. All the band
+           groups attached to the ioband device are also removed
+           automatically.
+
+   EXAMPLE
+
+             Remove ioband device "ioband1."
+
+             # dmsetup remove ioband1
+
+
+   --------------------------------------------------------------------------
+
+  Set an ioband group type
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 type TYPE
+
+   DESCRIPTION
+
+             Set the ioband group type of the specified ioband device
+           IOBAND_DEVICE. TYPE must be one of "none", "user", "gid", "pid" or
+           "pgrp." The type "cgroup" is enabled by applying the bio-cgroup
+           patch. Once the type is set, new ioband groups can be created on
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the ioband group type of ioband device "ioband1" to "user."
+
+             # dmsetup message ioband1 0 type user
+
+
+   --------------------------------------------------------------------------
+
+  Create an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 attach ID
+
+   DESCRIPTION
+
+             Create an ioband group and attach it to IOBAND_DEVICE. ID
+           specifies user-id, group-id, process-id or process-group-id
+           depending the ioband group type of IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Create an ioband group which consists of all processes with
+           user-id 1000 and attach it to ioband device "ioband1."
+
+             # dmsetup message ioband1 0 type user
+             # dmsetup message ioband1 0 attach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Detach the ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 detach ID
+
+   DESCRIPTION
+
+             Detach the ioband group specified by ID from ioband device
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Detach the ioband group with ID "2000" from ioband device
+           "ioband2."
+
+             # dmsetup message ioband2 0 detach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Set bandwidth control policy
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 policy policy
+
+   DESCRIPTION
+
+             Set bandwidth control policy. This command applies to all ioband
+           devices which have the same ioband device ID as IOBAND_DEVICE. A
+           user can choose either policy "weight" or "weight-iosize."
+
+                weight
+
+                          This policy controls bandwidth according to the
+                        proportional to the weight of each ioband group based
+                        on the number of I/O requests.
+
+                weight-iosize
+
+                          This policy controls bandwidth according to the
+                        proportional to the weight of each ioband group based
+                        on the number of I/O sectors.
+
+   EXAMPLE
+
+             Set bandwidth control policy of ioband devices which have the
+           same ioband device ID as "ioband1" to "weight-iosize."
+
+             # dmsetup message ioband1 0 policy weight-iosize
+
+
+   --------------------------------------------------------------------------
+
+  Set the weight of an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 weight VAL
+
+           dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+   DESCRIPTION
+
+             Set the weight of the ioband group specified by ID. Set the
+           weight of the default ioband group of IOBAND_DEVICE if ID isn't
+           specified.
+
+             The following example means that "ioband1" can use 80% ---
+           40/(40+10)*100 --- of the bandwidth of the physical disk while
+           "ioband2" can use 20%.
+
+             # dmsetup message ioband1 0 weight 40
+             # dmsetup message ioband2 0 weight 10
+
+
+             The following lines have the same effect as the above:
+
+             # dmsetup message ioband1 0 weight 4
+             # dmsetup message ioband2 0 weight 1
+
+
+             VAL must be an integer larger than 0. The default value, which
+           is assigned to newly created ioband groups, is 100.
+
+   EXAMPLE
+
+             Set the weight of the default ioband group of "ioband1" to 40.
+
+             # dmsetup message ioband1 0 weight 40
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10.
+
+             # dmsetup message ioband1 0 weight 1000:10
+
+
+   --------------------------------------------------------------------------
+
+  Set the number of tokens
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 token VAL
+
+   DESCRIPTION
+
+             Set the number of tokens to VAL. According to their weight, this
+           number of tokens will be distributed to all the ioband groups on
+           the physical device to which ioband device IOBAND_DEVICE belongs
+           when they use up their tokens.
+
+             VAL must be an integer greater than 0. The default is 2048.
+
+   EXAMPLE
+
+             Set the number of tokens of the physical device to which
+           "ioband1" belongs to 256.
+
+             # dmsetup message ioband1 0 token 256
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O throttling
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+   DESCRIPTION
+
+             Set the I/O throttling value of the physical disk to which
+           ioband device IOBAND_DEVICE belongs to VAL. Dm-ioband start to
+           control the bandwidth when the number of BIOs in progress on the
+           physical disk exceeds this value.
+
+   EXAMPLE
+
+             Set the I/O throttling value of "ioband1" to 16.
+
+             # dmsetup message ioband1 0 io_throttle 16
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O limiting
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+   DESCRIPTION
+
+             Set the I/O limiting value of the physical disk to which ioband
+           device IOBAND_DEVICE belongs to VAL. Dm-ioband will block all I/O
+           requests for the physical device if the number of BIOs in progress
+           on the physical disk exceeds this value.
+
+   EXAMPLE
+
+             Set the I/O limiting value of "ioband1" to 128.
+
+             # dmsetup message ioband1 0 io_limit 128
+
+
+   --------------------------------------------------------------------------
+
+  Display settings
+
+   SYNOPSIS
+
+           dmsetup table --target ioband
+
+   DESCRIPTION
+
+             Display the current table for the ioband device in a format. See
+           "dmsetup create" command for information on the table format.
+
+   EXAMPLE
+
+             The following output shows the current table of "ioband1."
+
+             # dmsetup table --target ioband
+             ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+               2048 :100 1000:80 2000:20
+
+
+   --------------------------------------------------------------------------
+
+  Display Statistics
+
+   SYNOPSIS
+
+           dmsetup status --target ioband
+
+   DESCRIPTION
+
+             Display the statistics of all the ioband devices whose target
+           type is "ioband."
+
+             The output format is as below. the first five columns shows:
+
+              *   ioband device name
+
+              *   logical start sector of the device (must be 0)
+
+              *   device size in sectors
+
+              *   target type (must be "ioband")
+
+              *   device group ID
+
+             The remaining columns show the statistics of each ioband group
+           on the band device. Each group uses seven columns for its
+           statistics.
+
+              *   ioband group ID (-1 means default)
+
+              *   total read requests
+
+              *   delayed read requests
+
+              *   total read sectors
+
+              *   total write requests
+
+              *   delayed write requests
+
+              *   total write sectors
+
+   EXAMPLE
+
+             The following output shows the statistics of two ioband devices.
+           Ioband2 only has the default ioband group and ioband1 has three
+           (default, 1001, 1002) ioband groups.
+
+             # dmsetup status
+             ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+             ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+             166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+   --------------------------------------------------------------------------
+
+  Reset status counter
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 reset
+
+   DESCRIPTION
+
+             Reset the statistics of ioband device IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Reset the statistics of "ioband1."
+
+             # dmsetup message ioband1 0 reset
+
+
+   --------------------------------------------------------------------------
+
+Examples
+
+  Example #1: Bandwidth control on Partitions
+
+     This example describes how to control the bandwidth with disk
+   partitions. The following diagram illustrates the configuration of this
+   example. You may want to run a database on /dev/mapper/ioband1 and web
+   applications on /dev/mapper/ioband2.
+
+                 /mnt1                        /mnt2            mount points
+                   |                              |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical devices
+     +---------------------------+---------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create ioband devices with the same device group ID and assign
+       weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+             "none weight 0 :80" | dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+             "none weight 0 :40" | dmsetup create ioband2
+
+
+    2.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #2: Bandwidth control on Logical Volumes
+
+     This example is similar to the example #1 but it uses LVM logical
+   volumes instead of disk partitions. This example shows how to configure
+   ioband devices on two striped logical volumes.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |      /dev/mapper/lv0     | |     /dev/mapper/lv1      | striped logical
+     |                          | |                          | volumes
+     +-------------------------------------------------------+
+     |                          vg0                          | volume group
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/sdb         | |         /dev/sdc         | physical devices
+     +--------------------------+ +--------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Initialize the partitions for use by LVM.
+
+         # pvcreate /dev/sdb
+         # pvcreate /dev/sdc
+
+
+    2.   Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+         # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+    3.   Create two logical volumes in "vg0." The volumes have to be striped.
+
+         # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+         # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+         The rest is the same as the example #1.
+
+    4.   Create ioband devices corresponding to each logical volume and
+       assign weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+            "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+            dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+            "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+            dmsetup create ioband2
+
+
+    5.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #3: Bandwidth control on processes
+
+     This example describes how to control the bandwidth with groups of
+   processes. You may also want to run an additional application on the same
+   machine described in the example #1. This example shows how to add a new
+   ioband group for this application.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +-------------+------------+ +-------------+------------+
+     |          default         | |  user=1000  |   default  | ioband groups
+     |           (80)           | |     (20)    |    (40)    |   (weight)
+     +-------------+------------+ +-------------+------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical device
+     +---------------------------+---------------------------+
+
+
+     The following shows to set up a new ioband group on the machine that is
+   already configured as the example #1. The application will have a weight
+   of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+    1.   Set the type of ioband2 to "user."
+
+         # dmsetup message ioband2 0 type user.
+
+
+    2.   Create a new ioband group on ioband2.
+
+         # dmsetup message ioband2 0 attach 1000
+
+
+    3.   Assign weight of 10 to this newly created ioband group.
+
+         # dmsetup message ioband2 0 weight 1000:20
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control for Xen virtual block devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices. The following diagram illustrates the configuration of this
+   example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical device
+     +---------------------------+---------------------------+
+
+
+     The followings shows how to map ioband device "ioband1" and "ioband2" to
+   virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+   Virtual Machine 2" respectively on the machine configured as the example
+   #1. Add the following lines to the configuration files that are referenced
+   when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+       For "Virtual Machine 1"
+       disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+       For "Virtual Machine 2"
+       disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+   --------------------------------------------------------------------------
+
+  Example #5: Bandwidth control for Xen blktap devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices when Xen blktap devices are used. The following diagram
+   illustrates the configuration of this example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V----------------------------V------------+
+     |                  /dev/mapper/ioband1                  | ioband device
+     +---------------------------+---------------------------+
+     |       default group       |        default group      | ioband groups
+     |           (80)            |            (40)           |    (weight)
+     +-------------|-------------+--------------|------------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |  +----------V----------+      +----------V---------+  |
+     |  |       vm1.img       |      |       vm2.img      |  | disk image files
+     |  +---------------------+      +--------------------+  |
+     |                        /vmdisk                        | mount point
+     +---------------------------|---------------------------+
+                                 |
+     +---------------------------V---------------------------+
+     |                       /dev/sda1                       | physical device
+     +-------------------------------------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create an ioband device.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+             "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+    2.   Add the following lines to the configuration files that are
+       referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+       Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+    3.   Run the virtual machines.
+
+         # xm create vm1
+         # xm create vm2
+
+
+    4.   Find out the process IDs of the daemons which control the blktap
+       devices.
+
+         # lsof /vmdisk/disk[12].img
+         COMMAND   PID USER   FD   TYPE DEVICE       SIZE  NODE NAME
+         tapdisk 15011 root   11u   REG  253,0 2147483648 48961 /vmdisk/vm1.img
+         tapdisk 15276 root   13u   REG  253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+    5.   Create new ioband groups of pid 15011 and pid 15276, which are
+       process IDs of the tapdisks, and assign weight of 80 and 40 to the
+       groups respectively.
+
+         # dmsetup message ioband1 0 type pid
+         # dmsetup message ioband1 0 attach 15011
+         # dmsetup message ioband1 0 weight 15011:80
+         # dmsetup message ioband1 0 attach 15276
+         # dmsetup message ioband1 0 weight 15276:40

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/7] bio-cgroup: Introduction
  2008-08-12 12:34   ` [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples Ryo Tsuruta
@ 2008-08-12 12:34     ` Ryo Tsuruta
  2008-08-12 12:35       ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Ryo Tsuruta
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:34 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

With this series of bio-cgruop patches, you can determine the owners of any
type of I/Os and it makes dm-ioband -- I/O bandwidth controller --
be able to control the Block I/O bandwidths even when it accepts
delayed write requests.
Dm-ioband can find the owner cgroup of each request.
It is also possible that the other people who work on the I/O
bandwidth throttling use this functionality to control asynchronous
I/Os with a little enhancement.

You have to apply the patch dm-ioband v1.5.0 before applying this series
of patches.

And you have to select the following config options when compiling kernel:
  CONFIG_CGROUPS=y
  CONFIG_CGROUP_BIO=y
And I recommend you should also select the options for cgroup memory
subsystem, because it makes it possible to give some I/O bandwidth
and some memory to a certain cgroup to control delayed write requests
and the processes in the cgroup will be able to make pages dirty only
inside the cgroup even when the given bandwidth is narrow.
  CONFIG_RESOURCE_COUNTERS=y
  CONFIG_CGROUP_MEM_RES_CTLR=y

This code is based on some part of the memory subsystem of cgroup
and I don't think the accuracy and overhead of the subsystem can be ignored
at this time, so we need to keep tuning it up.

 --------------------------------------------------------

The following shows how to use dm-ioband with cgroups.
Please assume that you want make two cgroups, which we call "bio cgroup"
here, to track down block I/Os and assign them to ioband device "ioband1".

First, mount the bio cgroup filesystem.

 # mount -t cgroup -o bio none /cgroup/bio

Then, make new bio cgroups and put some processes in them.

 # mkdir /cgroup/bio/bgroup1
 # mkdir /cgroup/bio/bgroup2
 # echo 1234 > /cgroup/bio/bgroup1/tasks
 # echo 5678 > /cgroup/bio/bgroup1/tasks

Now, check the ID of each bio cgroup which is just created.

 # cat /cgroup/bio/bgroup1/bio.id
   1
 # cat /cgroup/bio/bgroup2/bio.id
   2

Finally, attach the cgroups to "ioband1" and assign them weights.

 # dmsetup message ioband1 0 type cgroup
 # dmsetup message ioband1 0 attach 1
 # dmsetup message ioband1 0 attach 2
 # dmsetup message ioband1 0 weight 1:30
 # dmsetup message ioband1 0 weight 2:60

You can also make use of the dm-ioband administration tool if you want.
The tool will be found here:
http://people.valinux.co.jp/~kaizuka/dm-ioband/iobandctl/manual.html
You can set up the device with the tool as follows.
In this case, you don't need to know the IDs of the cgroups.

 # iobandctl.py group /dev/mapper/ioband1 cgroup /cgroup/bio/bgroup1:30 /cgroup/bio/bgroup2:60

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-12 12:34     ` [PATCH 3/7] bio-cgroup: Introduction Ryo Tsuruta
@ 2008-08-12 12:35       ` Ryo Tsuruta
  2008-08-12 12:36         ` [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s Ryo Tsuruta
  2008-08-18  1:39         ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:35 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

This patch splits the cgroup memory subsystem into two parts.
One is for tracking pages to find out the owners. The other is
for controlling how much amount of memory should be assigned to
each cgroup.

With this patch, you can use the page tracking mechanism even if
the memory subsystem is off.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1.ioband/include/linux/memcontrol.h	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h	2008-08-12 14:47:11.000000000 +0900
@@ -20,12 +20,62 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+#include <linux/rcupdate.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+#ifdef CONFIG_CGROUP_PAGE
+/*
+ * We use the lower bit of the page->page_cgroup pointer as a bit spin
+ * lock.  We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin).  But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
+ */
+#define PAGE_CGROUP_LOCK_BIT    0x0
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK        (1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK        0x0
+#endif
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ */
+struct page_cgroup {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct list_head lru;		/* per cgroup LRU list */
+	struct mem_cgroup *mem_cgroup;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	struct page *page;
+	int flags;
+};
+#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE	(0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
+#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+static inline void lock_page_cgroup(struct page *page)
+{
+	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static inline int try_lock_page_cgroup(struct page *page)
+{
+	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static inline void unlock_page_cgroup(struct page *page)
+{
+	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
 
 #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
 
@@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page
 				gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
-extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
-
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file);
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
-
-extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-#define mm_match_cgroup(mm, cgroup)	\
-	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
 extern int
 mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
 extern void mem_cgroup_end_migration(struct page *page);
+extern void page_cgroup_init(void);
 
-/*
- * For memory reclaim.
- */
-extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
-extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
-
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-							int priority);
-
-extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
-					int priority, enum lru_list lru);
-
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 static inline void page_reset_bad_cgroup(struct page *page)
 {
 }
@@ -102,6 +122,53 @@ static inline void mem_cgroup_uncharge_c
 {
 }
 
+static inline int
+mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+{
+	return 0;
+}
+
+static inline void mem_cgroup_end_migration(struct page *page)
+{
+}
+#endif /* CONFIG_CGROUP_PAGE */
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+
+extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
+
+extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
+					struct list_head *dst,
+					unsigned long *scanned, int order,
+					int mode, struct zone *z,
+					struct mem_cgroup *mem_cont,
+					int active, int file);
+extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
+int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+
+extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+
+#define mm_match_cgroup(mm, cgroup)	\
+	((cgroup) == mem_cgroup_from_task((mm)->owner))
+
+/*
+ * For memory reclaim.
+ */
+extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
+extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
+
+extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
+extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
+							int priority);
+extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
+							int priority);
+
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru);
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
 static inline int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return 0;
@@ -122,16 +189,6 @@ static inline int task_in_mem_cgroup(str
 	return 1;
 }
 
-static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
-{
-	return 0;
-}
-
-static inline void mem_cgroup_end_migration(struct page *page)
-{
-}
-
 static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
 {
 	return 0;
@@ -163,7 +220,8 @@ static inline long mem_cgroup_calc_recla
 {
 	return 0;
 }
-#endif /* CONFIG_CGROUP_MEM_CONT */
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 #endif /* _LINUX_MEMCONTROL_H */
 
diff -Ndupr linux-2.6.27-rc1-mm1.ioband/include/linux/mm_types.h linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h
--- linux-2.6.27-rc1-mm1.ioband/include/linux/mm_types.h	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h	2008-08-12 14:47:11.000000000 +0900
@@ -92,7 +92,7 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	unsigned long page_cgroup;
 #endif
 
diff -Ndupr linux-2.6.27-rc1-mm1.ioband/init/Kconfig linux-2.6.27-rc1-mm1.cg0/init/Kconfig
--- linux-2.6.27-rc1-mm1.ioband/init/Kconfig	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/init/Kconfig	2008-08-12 14:47:11.000000000 +0900
@@ -418,6 +418,10 @@ config CGROUP_MEMRLIMIT_CTLR
 	  memory RSS and Page Cache control. Virtual address space control
 	  is provided by this controller.
 
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR
+
 config SYSFS_DEPRECATED
 	bool
 
diff -Ndupr linux-2.6.27-rc1-mm1.ioband/mm/Makefile linux-2.6.27-rc1-mm1.cg0/mm/Makefile
--- linux-2.6.27-rc1-mm1.ioband/mm/Makefile	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/mm/Makefile	2008-08-12 14:47:11.000000000 +0900
@@ -34,5 +34,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
diff -Ndupr linux-2.6.27-rc1-mm1.ioband/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.ioband/mm/memcontrol.c	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c	2008-08-12 14:47:11.000000000 +0900
@@ -36,10 +36,25 @@
 
 #include <asm/uaccess.h>
 
-struct cgroup_subsys mem_cgroup_subsys __read_mostly;
+enum charge_type {
+	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
+	MEM_CGROUP_CHARGE_TYPE_MAPPED,
+	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
+};
+
+static void __mem_cgroup_uncharge_common(struct page *, enum charge_type);
+
 static struct kmem_cache *page_cgroup_cache __read_mostly;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+static inline int mem_cgroup_disabled(void)
+{
+	return mem_cgroup_subsys.disabled;
+}
+
 /*
  * Statistics for memory cgroup.
  */
@@ -136,35 +151,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	int flags;
-};
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
-
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -175,12 +161,6 @@ static enum zone_type page_cgroup_zid(st
 	return page_zonenum(pc->page);
 }
 
-enum charge_type {
-	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
-	MEM_CGROUP_CHARGE_TYPE_MAPPED,
-	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
-};
-
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -248,37 +228,6 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -367,7 +316,7 @@ void mem_cgroup_move_lists(struct page *
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 
 	/*
@@ -506,273 +455,6 @@ unsigned long mem_cgroup_isolate_pages(u
 }
 
 /*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
- */
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype,
-				struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
-	unsigned long flags;
-	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup_per_zone *mz;
-
-	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
-	if (unlikely(pc == NULL))
-		goto err;
-
-	/*
-	 * We always charge the cgroup the mm_struct belongs to.
-	 * The mm_struct's mem_cgroup changes on task migration if the
-	 * thread group leader migrates. It's possible that mm is not
-	 * set, if so charge the init_mm (happens for pagecache usage).
-	 */
-	if (likely(!memcg)) {
-		rcu_read_lock();
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		/*
-		 * For every charge from the cgroup, increment reference count
-		 */
-		css_get(&mem->css);
-		rcu_read_unlock();
-	} else {
-		mem = memcg;
-		css_get(&memcg->css);
-	}
-
-	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
-		}
-	}
-
-	pc->mem_cgroup = mem;
-	pc->page = page;
-	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
-	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
-		pc->flags = PAGE_CGROUP_FLAG_CACHE;
-		if (page_is_file_cache(page))
-			pc->flags |= PAGE_CGROUP_FLAG_FILE;
-		else
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-	} else
-		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
-
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
-	page_assign_page_cgroup(page, pc);
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-	unlock_page_cgroup(page);
-done:
-	return 0;
-out:
-	css_put(&mem->css);
-	kmem_cache_free(page_cgroup_cache, pc);
-err:
-	return -ENOMEM;
-}
-
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
-{
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-
-	/*
-	 * If already mapped, we don't have to account.
-	 * If page cache, page->mapping has address_space.
-	 * But page->mapping may have out-of-use anon_vma pointer,
-	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
-	 * is NULL.
-  	 */
-	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
-		return 0;
-	if (unlikely(!mm))
-		mm = &init_mm;
-	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
-}
-
-int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
-{
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-
-	/*
-	 * Corner case handling. This is called from add_to_page_cache()
-	 * in usual. But some FS (shmem) precharges this page before calling it
-	 * and call add_to_page_cache() with GFP_NOWAIT.
-	 *
-	 * For GFP_NOWAIT case, the page may be pre-charged before calling
-	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
-	 * charge twice. (It works but has to pay a bit larger cost.)
-	 */
-	if (!(gfp_mask & __GFP_WAIT)) {
-		struct page_cgroup *pc;
-
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		if (pc) {
-			VM_BUG_ON(pc->page != page);
-			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
-			return 0;
-		}
-		unlock_page_cgroup(page);
-	}
-
-	if (unlikely(!mm))
-		mm = &init_mm;
-
-	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static void
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	if (mem_cgroup_subsys.disabled)
-		return;
-
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
-
-	VM_BUG_ON(pc->page != page);
-
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
-		|| page_mapped(page)))
-		goto unlock;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
-	mem = pc->mem_cgroup;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
-
-	kmem_cache_free(page_cgroup_cache, pc);
-	return;
-unlock:
-	unlock_page_cgroup(page);
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON(page_mapped(page));
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
-}
-
-/*
- * Before starting migration, account against new page.
- */
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup *mem = NULL;
-	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
-	int ret = 0;
-
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc) {
-		mem = pc->mem_cgroup;
-		css_get(&mem->css);
-		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
-			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	}
-	unlock_page_cgroup(page);
-	if (mem) {
-		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
-			ctype, mem);
-		css_put(&mem->css);
-	}
-	return ret;
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
-{
-	/*
-	 * At success, page->mapping is not NULL.
-	 * special rollback care is necessary when
-	 * 1. at migration failure. (newpage->mapping is cleared in this case)
-	 * 2. the newpage was moved but not remapped again because the task
-	 *    exits and the newpage is obsolete. In this case, the new page
-	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
-	 *    always for avoiding mess. The  page_cgroup will be removed if
-	 *    unnecessary. File cache pages is still on radix-tree. Don't
-	 *    care it.
-	 */
-	if (!newpage->mapping)
-		__mem_cgroup_uncharge_common(newpage,
-					 MEM_CGROUP_CHARGE_TYPE_FORCE);
-	else if (PageAnon(newpage))
-		mem_cgroup_uncharge_page(newpage);
-}
-
-/*
  * A call to try to shrink memory usage under specified resource controller.
  * This is typically used for page reclaiming for shmem for reducing side
  * effect of page allocation from shmem, which is used by some mem_cgroup.
@@ -783,7 +465,7 @@ int mem_cgroup_shrink_usage(struct mm_st
 	int progress = 0;
 	int retry = MEM_CGROUP_RECLAIM_RETRIES;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 
 	rcu_read_lock();
@@ -1104,7 +786,7 @@ mem_cgroup_create(struct cgroup_subsys *
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
-		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+		page_cgroup_init();
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
@@ -1188,3 +870,325 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline int mem_cgroup_disabled(void)
+{
+	return 1;
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline int page_cgroup_locked(struct page *page)
+{
+	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
+{
+	VM_BUG_ON(!page_cgroup_locked(page));
+	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+}
+
+struct page_cgroup *page_get_page_cgroup(struct page *page)
+{
+	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
+}
+
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask, enum charge_type ctype,
+				struct mem_cgroup *memcg)
+{
+	struct page_cgroup *pc;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem;
+	unsigned long flags;
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct mem_cgroup_per_zone *mz;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
+	if (unlikely(pc == NULL))
+		goto err;
+
+	/*
+	 * We always charge the cgroup the mm_struct belongs to.
+	 * The mm_struct's mem_cgroup changes on task migration if the
+	 * thread group leader migrates. It's possible that mm is not
+	 * set, if so charge the init_mm (happens for pagecache usage).
+	 */
+	if (likely(!memcg)) {
+		rcu_read_lock();
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		/*
+		 * For every charge from the cgroup, increment reference count
+		 */
+		css_get(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		rcu_read_unlock();
+	} else {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+		mem = memcg;
+		css_get(&memcg->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+		if (!(gfp_mask & __GFP_WAIT))
+			goto out;
+
+		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+			continue;
+
+		/*
+		 * try_to_free_mem_cgroup_pages() might not give us a full
+		 * picture of reclaim. Some pages are reclaimed and might be
+		 * moved to swap cache or just unmapped from the cgroup.
+		 * Check the limit again to see if the reclaim reduced the
+		 * current usage of the cgroup before giving up
+		 */
+		if (res_counter_check_under_limit(&mem->res))
+			continue;
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			goto out;
+		}
+	}
+	pc->mem_cgroup = mem;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	pc->page = page;
+	/*
+	 * If a page is accounted as a page cache, insert to inactive list.
+	 * If anon, insert to active list.
+	 */
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
+		pc->flags = PAGE_CGROUP_FLAG_CACHE;
+		if (page_is_file_cache(page))
+			pc->flags |= PAGE_CGROUP_FLAG_FILE;
+		else
+			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+	} else
+		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+
+	lock_page_cgroup(page);
+	if (unlikely(page_get_page_cgroup(page))) {
+		unlock_page_cgroup(page);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		kmem_cache_free(page_cgroup_cache, pc);
+		goto done;
+	}
+	page_assign_page_cgroup(page, pc);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	mz = page_cgroup_zoneinfo(pc);
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	__mem_cgroup_add_list(mz, pc);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	unlock_page_cgroup(page);
+done:
+	return 0;
+out:
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	kmem_cache_free(page_cgroup_cache, pc);
+err:
+	return -ENOMEM;
+}
+
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	/*
+	 * If already mapped, we don't have to account.
+	 * If page cache, page->mapping has address_space.
+	 * But page->mapping may have out-of-use anon_vma pointer,
+	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
+	 * is NULL.
+	 */
+	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+		return 0;
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	/*
+	 * Corner case handling. This is called from add_to_page_cache()
+	 * in usual. But some FS (shmem) precharges this page before calling it
+	 * and call add_to_page_cache() with GFP_NOWAIT.
+	 *
+	 * For GFP_NOWAIT case, the page may be pre-charged before calling
+	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
+	 * charge twice. (It works but has to pay a bit larger cost.)
+	 */
+	if (!(gfp_mask & __GFP_WAIT)) {
+		struct page_cgroup *pc;
+
+		lock_page_cgroup(page);
+		pc = page_get_page_cgroup(page);
+		if (pc) {
+			VM_BUG_ON(pc->page != page);
+			VM_BUG_ON(!pc->mem_cgroup);
+			unlock_page_cgroup(page);
+			return 0;
+		}
+		unlock_page_cgroup(page);
+	}
+
+	if (unlikely(!mm))
+		mm = &init_mm;
+
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+}
+
+/*
+ * uncharge if !page_mapped(page)
+ */
+static void
+__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
+{
+	struct page_cgroup *pc;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/*
+	 * Check if our page_cgroup is valid
+	 */
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	if (unlikely(!pc))
+		goto unlock;
+
+	VM_BUG_ON(pc->page != page);
+
+	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
+		|| page_mapped(page)
+		|| PageSwapCache(page)))
+		goto unlock;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	mz = page_cgroup_zoneinfo(pc);
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	__mem_cgroup_remove_list(mz, pc);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	page_assign_page_cgroup(page, NULL);
+	unlock_page_cgroup(page);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	mem = pc->mem_cgroup;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+	kmem_cache_free(page_cgroup_cache, pc);
+	return;
+unlock:
+	unlock_page_cgroup(page);
+}
+
+void mem_cgroup_uncharge_page(struct page *page)
+{
+	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_uncharge_cache_page(struct page *page)
+{
+	VM_BUG_ON(page_mapped(page));
+	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
+}
+
+/*
+ * Before starting migration, account against new page.
+ */
+int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem = NULL;
+	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+	int ret = 0;
+
+	if (mem_cgroup_disabled())
+		return 0;
+
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	if (pc) {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+		mem = pc->mem_cgroup;
+		css_get(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+	}
+	unlock_page_cgroup(page);
+	if (mem) {
+		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
+			ctype, mem);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+		css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	}
+	return ret;
+}
+
+/* remove redundant charge if migration failed*/
+void mem_cgroup_end_migration(struct page *newpage)
+{
+	/*
+	 * At success, page->mapping is not NULL.
+	 * special rollback care is necessary when
+	 * 1. at migration failure. (newpage->mapping is cleared in this case)
+	 * 2. the newpage was moved but not remapped again because the task
+	 *    exits and the newpage is obsolete. In this case, the new page
+	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
+	 *    always for avoiding mess. The  page_cgroup will be removed if
+	 *    unnecessary. File cache pages is still on radix-tree. Don't
+	 *    care it.
+	 */
+	if (!newpage->mapping)
+		__mem_cgroup_uncharge_common(newpage,
+					 MEM_CGROUP_CHARGE_TYPE_FORCE);
+	else if (PageAnon(newpage))
+		mem_cgroup_uncharge_page(newpage);
+}
+
+void page_cgroup_init()
+{
+	if (!page_cgroup_cache)
+		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s
  2008-08-12 12:35       ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Ryo Tsuruta
@ 2008-08-12 12:36         ` Ryo Tsuruta
  2008-08-12 12:36           ` [PATCH 6/7] bio-cgroup: Implement the bio-cgroup Ryo Tsuruta
  2008-08-18  1:39         ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:36 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

This patch is for cleaning up the code of the cgroup memory subsystem
to remove a lot of "#ifdef"s.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c	2008-08-12 14:47:11.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c	2008-08-12 14:47:40.000000000 +0900
@@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
+static inline void get_mem_cgroup(struct mem_cgroup *mem)
+{
+	css_get(&mem->css);
+}
+
+static inline void put_mem_cgroup(struct mem_cgroup *mem)
+{
+	css_put(&mem->css);
+}
+
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+					struct mem_cgroup *mem)
+{
+	pc->mem_cgroup = mem;
+}
+
+static inline void clear_mem_cgroup(struct page_cgroup *pc)
+{
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	pc->mem_cgroup = NULL;
+	put_mem_cgroup(mem);
+}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	css_get(&mem->css);
+	return mem;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	get_mem_cgroup(mem);
+	return mem;
+}
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru
 	list_move(&pc->lru, &mz->lists[lru]);
 }
 
+static inline void mem_cgroup_add_page(struct page_cgroup *pc)
+{
+	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	__mem_cgroup_add_list(mz, pc);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+}
+
+static inline void mem_cgroup_remove_page(struct page_cgroup *pc)
+{
+	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	__mem_cgroup_remove_list(mz, pc);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+}
+
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
@@ -339,6 +400,36 @@ void mem_cgroup_move_lists(struct page *
 	unlock_page_cgroup(page);
 }
 
+static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem,
+						gfp_t gfp_mask)
+{
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+
+	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+		if (!(gfp_mask & __GFP_WAIT))
+			return -1;
+
+		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+			continue;
+
+		/*
+		 * try_to_free_mem_cgroup_pages() might not give us a full
+		 * picture of reclaim. Some pages are reclaimed and might be
+		 * moved to swap cache or just unmapped from the cgroup.
+		 * Check the limit again to see if the reclaim reduced the
+		 * current usage of the cgroup before giving up
+		 */
+		if (res_counter_check_under_limit(&mem->res))
+			continue;
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			return -1;
+		}
+	}
+	return 0;
+}
+
 /*
  * Calculate mapped_ratio under memory controller. This will be used in
  * vmscan.c for deteremining we have to reclaim mapped pages.
@@ -469,15 +560,14 @@ int mem_cgroup_shrink_usage(struct mm_st
 		return 0;
 
 	rcu_read_lock();
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	css_get(&mem->css);
+	mem = mm_get_mem_cgroup(mm);
 	rcu_read_unlock();
 
 	do {
 		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
 	} while (!progress && --retry);
 
-	css_put(&mem->css);
+	put_mem_cgroup(mem);
 	if (!retry)
 		return -ENOMEM;
 	return 0;
@@ -558,7 +648,7 @@ static int mem_cgroup_force_empty(struct
 	int ret = -EBUSY;
 	int node, zid;
 
-	css_get(&mem->css);
+	get_mem_cgroup(mem);
 	/*
 	 * page reclaim code (kswapd etc..) will move pages between
 	 * active_list <-> inactive_list while we don't take a lock.
@@ -578,7 +668,7 @@ static int mem_cgroup_force_empty(struct
 	}
 	ret = 0;
 out:
-	css_put(&mem->css);
+	put_mem_cgroup(mem);
 	return ret;
 }
 
@@ -873,10 +963,37 @@ struct cgroup_subsys mem_cgroup_subsys =
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 
+struct mem_cgroup;
+
 static inline int mem_cgroup_disabled(void)
 {
 	return 1;
 }
+
+static inline void mem_cgroup_add_page(struct page_cgroup *pc) {}
+static inline void mem_cgroup_remove_page(struct page_cgroup *pc) {}
+static inline void get_mem_cgroup(struct mem_cgroup *mem) {}
+static inline void put_mem_cgroup(struct mem_cgroup *mem) {}
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+					struct mem_cgroup *mem) {}
+static inline void clear_mem_cgroup(struct page_cgroup *pc) {}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+	return NULL;
+}
+
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	return NULL;
+}
+
+static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem,
+					gfp_t gfp_mask)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 static inline int page_cgroup_locked(struct page *page)
@@ -906,12 +1023,7 @@ static int mem_cgroup_charge_common(stru
 				struct mem_cgroup *memcg)
 {
 	struct page_cgroup *pc;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	struct mem_cgroup *mem;
-	unsigned long flags;
-	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup_per_zone *mz;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
 	if (unlikely(pc == NULL))
@@ -925,47 +1037,16 @@ static int mem_cgroup_charge_common(stru
 	 */
 	if (likely(!memcg)) {
 		rcu_read_lock();
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		/*
-		 * For every charge from the cgroup, increment reference count
-		 */
-		css_get(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		mem = mm_get_mem_cgroup(mm);
 		rcu_read_unlock();
 	} else {
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
 		mem = memcg;
-		css_get(&memcg->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
-	}
-
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
-		}
+		get_mem_cgroup(mem);
 	}
-	pc->mem_cgroup = mem;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
+	if (mem_cgroup_try_to_allocate(mem, gfp_mask) < 0)
+		goto out;
+	set_mem_cgroup(pc, mem);
 	pc->page = page;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
@@ -983,29 +1064,19 @@ static int mem_cgroup_charge_common(stru
 	lock_page_cgroup(page);
 	if (unlikely(page_get_page_cgroup(page))) {
 		unlock_page_cgroup(page);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		clear_mem_cgroup(pc);
 		kmem_cache_free(page_cgroup_cache, pc);
 		goto done;
 	}
 	page_assign_page_cgroup(page, pc);
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	mem_cgroup_add_page(pc);
 
 	unlock_page_cgroup(page);
 done:
 	return 0;
 out:
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	put_mem_cgroup(mem);
 	kmem_cache_free(page_cgroup_cache, pc);
 err:
 	return -ENOMEM;
@@ -1074,11 +1145,6 @@ static void
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 	if (mem_cgroup_disabled())
 		return;
@@ -1099,21 +1165,12 @@ __mem_cgroup_uncharge_common(struct page
 		|| PageSwapCache(page)))
 		goto unlock;
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	mem_cgroup_remove_page(pc);
 
 	page_assign_page_cgroup(page, NULL);
 	unlock_page_cgroup(page);
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	mem = pc->mem_cgroup;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	clear_mem_cgroup(pc);
 
 	kmem_cache_free(page_cgroup_cache, pc);
 	return;
@@ -1148,10 +1205,7 @@ int mem_cgroup_prepare_migration(struct 
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
 	if (pc) {
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-		mem = pc->mem_cgroup;
-		css_get(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		mem = get_mem_page_cgroup(pc);
 		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
@@ -1159,9 +1213,7 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-		css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+		put_mem_cgroup(mem);
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 6/7] bio-cgroup: Implement the bio-cgroup
  2008-08-12 12:36         ` [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s Ryo Tsuruta
@ 2008-08-12 12:36           ` Ryo Tsuruta
  2008-08-12 12:37             ` [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband Ryo Tsuruta
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:36 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

This patch implements the bio cgroup on the memory cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c
--- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c	2008-08-12 14:47:59.000000000 +0900
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h	2008-08-12 14:48:00.000000000 +0900
@@ -0,0 +1,159 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/memcontrol.h>
+
+#ifndef _LINUX_BIOCONTROL_H
+#define _LINUX_BIOCONTROL_H
+
+#ifdef	CONFIG_CGROUP_BIO
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+	struct cgroup_subsys_state css;
+	int id;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+	spinlock_t		page_list_lock;
+	struct list_head	page_list;
+};
+
+static inline int bio_cgroup_disabled(void)
+{
+	return bio_cgroup_subsys.disabled;
+}
+
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+				struct bio_cgroup, css);
+}
+
+static inline void __bio_cgroup_add_page(struct page_cgroup *pc)
+{
+	struct bio_cgroup *biog = pc->bio_cgroup;
+	list_add(&pc->blist, &biog->page_list);
+}
+
+static inline void bio_cgroup_add_page(struct page_cgroup *pc)
+{
+	struct bio_cgroup *biog = pc->bio_cgroup;
+	unsigned long flags;
+	spin_lock_irqsave(&biog->page_list_lock, flags);
+	__bio_cgroup_add_page(pc);
+	spin_unlock_irqrestore(&biog->page_list_lock, flags);
+}
+
+static inline void __bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+	list_del_init(&pc->blist);
+}
+
+static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+	struct bio_cgroup *biog = pc->bio_cgroup;
+	unsigned long flags;
+	spin_lock_irqsave(&biog->page_list_lock, flags);
+	__bio_cgroup_remove_page(pc);
+	spin_unlock_irqrestore(&biog->page_list_lock, flags);
+}
+
+static inline void get_bio_cgroup(struct bio_cgroup *biog)
+{
+	css_get(&biog->css);
+}
+
+static inline void put_bio_cgroup(struct bio_cgroup *biog)
+{
+	css_put(&biog->css);
+}
+
+static inline void set_bio_cgroup(struct page_cgroup *pc,
+					struct bio_cgroup *biog)
+{
+	pc->bio_cgroup = biog;
+}
+
+static inline void clear_bio_cgroup(struct page_cgroup *pc)
+{
+	struct bio_cgroup *biog = pc->bio_cgroup;
+	pc->bio_cgroup = NULL;
+	put_bio_cgroup(biog);
+}
+
+static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
+{
+	struct bio_cgroup *biog = pc->bio_cgroup;
+	css_get(&biog->css);
+	return biog;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
+{
+	struct bio_cgroup *biog;
+	biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
+	get_bio_cgroup(biog);
+	return biog;
+}
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline int bio_cgroup_disabled(void)
+{
+	return 1;
+}
+
+static inline void bio_cgroup_add_page(struct page_cgroup *pc)
+{
+}
+
+static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+}
+
+static inline void get_bio_cgroup(struct bio_cgroup *biog)
+{
+}
+
+static inline void put_bio_cgroup(struct bio_cgroup *biog)
+{
+}
+
+static inline void set_bio_cgroup(struct page_cgroup *pc,
+					struct bio_cgroup *biog)
+{
+}
+
+static inline void clear_bio_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
+{
+	return NULL;
+}
+
+static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
+{
+	return NULL;
+}
+
+static inline int get_bio_cgroup_id(struct page *page)
+{
+	return 0;
+}
+
+static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+#endif	/* CONFIG_CGROUP_BIO */
+
+#endif /* _LINUX_BIOCONTROL_H */
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/cgroup_subsys.h linux-2.6.27-rc1-mm1.cg2/include/linux/cgroup_subsys.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/cgroup_subsys.h	2008-08-12 14:30:19.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/cgroup_subsys.h	2008-08-12 14:48:00.000000000 +0900
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BIO
+SUBSYS(bio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/iocontext.h linux-2.6.27-rc1-mm1.cg2/include/linux/iocontext.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/iocontext.h	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/iocontext.h	2008-08-12 14:48:00.000000000 +0900
@@ -83,6 +83,8 @@ struct io_context {
 	struct radix_tree_root radix_root;
 	struct hlist_head cic_list;
 	void *ioc_data;
+
+	int id;		/* cgroup ID */
 };
 
 static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -104,6 +106,7 @@ int put_io_context(struct io_context *io
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/memcontrol.h	2008-08-12 14:47:22.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/memcontrol.h	2008-08-12 14:48:00.000000000 +0900
@@ -54,6 +54,10 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct mem_cgroup *mem_cgroup;
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+#ifdef CONFIG_CGROUP_BIO
+	struct list_head blist;		/* for bio_cgroup page list */
+	struct bio_cgroup *bio_cgroup;
+#endif
 	struct page *page;
 	int flags;
 };
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/init/Kconfig linux-2.6.27-rc1-mm1.cg2/init/Kconfig
--- linux-2.6.27-rc1-mm1.cg1/init/Kconfig	2008-08-12 14:47:22.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/init/Kconfig	2008-08-12 14:48:00.000000000 +0900
@@ -418,9 +418,20 @@ config CGROUP_MEMRLIMIT_CTLR
 	  memory RSS and Page Cache control. Virtual address space control
 	  is provided by this controller.
 
+config CGROUP_BIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
 config CGROUP_PAGE
 	def_bool y
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
 
 config SYSFS_DEPRECATED
 	bool
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/Makefile linux-2.6.27-rc1-mm1.cg2/mm/Makefile
--- linux-2.6.27-rc1-mm1.cg1/mm/Makefile	2008-08-12 14:47:22.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/Makefile	2008-08-12 14:48:00.000000000 +0900
@@ -35,4 +35,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
+obj-$(CONFIG_CGROUP_BIO) += biocontrol.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/biocontrol.c linux-2.6.27-rc1-mm1.cg2/mm/biocontrol.c
--- linux-2.6.27-rc1-mm1.cg1/mm/biocontrol.c	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/biocontrol.c	2008-08-12 15:06:12.000000000 +0900
@@ -0,0 +1,219 @@
+/* biocontrol.c - Block I/O Controller
+ *
+ * Copyright IBM Corporation, 2007
+ * Author Balbir Singh <balbir@linux.vnet.ibm.com>
+ *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ * Author: Pavel Emelianov <xemul@openvz.org>
+ *
+ * Copyright VA Linux Systems Japan, 2008
+ * Author Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/blkdev.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
+#include <linux/err.h>
+#include <linux/biocontrol.h>
+
+/* return corresponding bio_cgroup object of a cgroup */
+static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
+			    struct bio_cgroup, css);
+}
+
+static struct idr bio_cgroup_id;
+static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
+
+static struct cgroup_subsys_state *
+bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog;
+	struct io_context *ioc;
+	int error;
+
+	if (!cgrp->parent) {
+		static struct bio_cgroup default_bio_cgroup;
+		static struct io_context default_bio_io_context;
+
+		biog = &default_bio_cgroup;
+		ioc = &default_bio_io_context;
+		init_io_context(ioc);
+
+		idr_init(&bio_cgroup_id);
+		biog->id = 0;
+
+		page_cgroup_init();
+	} else {
+		biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+		ioc = alloc_io_context(GFP_KERNEL, -1);
+		if (!ioc || !biog) {
+			error = -ENOMEM;
+			goto out;
+		}
+retry:
+		if (unlikely(!idr_pre_get(&bio_cgroup_id, GFP_KERNEL))) {
+			error = -EAGAIN;
+			goto out;
+		}
+		spin_lock_irq(&bio_cgroup_idr_lock);
+		error = idr_get_new_above(&bio_cgroup_id,
+						(void *)biog, 1, &biog->id);
+		spin_unlock_irq(&bio_cgroup_idr_lock);
+		if (error == -EAGAIN)
+			goto retry;
+		else if (error)
+			goto out;
+	}
+
+	ioc->id = biog->id;
+	biog->io_context = ioc;
+
+	INIT_LIST_HEAD(&biog->page_list);
+	spin_lock_init(&biog->page_list_lock);
+
+	/* Bind the cgroup to bio_cgroup object we just created */
+	biog->css.cgroup = cgrp;
+
+	return &biog->css;
+out:
+	if (ioc)
+		put_io_context(ioc);
+	kfree(biog);
+	return ERR_PTR(error);
+}
+
+#define FORCE_UNCHARGE_BATCH	(128)
+static void bio_cgroup_force_empty(struct bio_cgroup *biog)
+{
+	struct page_cgroup *pc;
+	struct page *page;
+	int count = FORCE_UNCHARGE_BATCH;
+	struct list_head *list = &biog->page_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&biog->page_list_lock, flags);
+	while (!list_empty(list)) {
+		pc = list_entry(list->prev, struct page_cgroup, blist);
+		page = pc->page;
+		get_page(page);
+		spin_unlock_irqrestore(&biog->page_list_lock, flags);
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		if (--count <= 0) {
+			count = FORCE_UNCHARGE_BATCH;
+			cond_resched();
+		}
+		spin_lock_irqsave(&biog->page_list_lock, flags);
+	}
+	spin_unlock_irqrestore(&biog->page_list_lock, flags);
+	return;
+}
+
+static void bio_cgroup_pre_destroy(struct cgroup_subsys *ss,
+							struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+	bio_cgroup_force_empty(biog);
+}
+
+static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+	put_io_context(biog->io_context);
+
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	idr_remove(&bio_cgroup_id, biog->id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+
+	kfree(biog);
+}
+
+struct bio_cgroup *find_bio_cgroup(int id)
+{
+	struct bio_cgroup *biog;
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	biog = (struct bio_cgroup *)
+	idr_find(&bio_cgroup_id, id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+	get_bio_cgroup(biog);
+	return biog;
+}
+
+struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	struct io_context *ioc;
+	struct page_cgroup *pc;
+	struct bio_cgroup *biog;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+
+	lock_page_cgroup(page);
+	pc = page_get_page_cgroup(page);
+	if (pc)
+		biog = pc->bio_cgroup;
+	else
+		biog = bio_cgroup_from_task(rcu_dereference(init_mm.owner));
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	unlock_page_cgroup(page);
+	return ioc;
+}
+EXPORT_SYMBOL(get_bio_cgroup_iocontext);
+
+static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+	return (u64) biog->id;
+}
+
+
+static struct cftype bio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = bio_id_read,
+	},
+};
+
+static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	if (bio_cgroup_disabled())
+		return 0;
+	return cgroup_add_files(cont, ss, bio_files, ARRAY_SIZE(bio_files));
+}
+
+static void bio_cgroup_move_task(struct cgroup_subsys *ss,
+				struct cgroup *cont,
+				struct cgroup *old_cont,
+				struct task_struct *p)
+{
+	/* do nothing */
+}
+
+
+struct cgroup_subsys bio_cgroup_subsys = {
+	.name           = "bio",
+	.subsys_id      = bio_cgroup_subsys_id,
+	.create         = bio_cgroup_create,
+	.destroy        = bio_cgroup_destroy,
+	.pre_destroy	= bio_cgroup_pre_destroy,
+	.populate       = bio_cgroup_populate,
+	.attach		= bio_cgroup_move_task,
+	.early_init	= 0,
+};
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg2/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c	2008-08-12 14:47:40.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/memcontrol.c	2008-08-12 14:48:00.000000000 +0900
@@ -20,6 +20,7 @@
 #include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
+#include <linux/biocontrol.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -1019,11 +1020,12 @@ struct page_cgroup *page_get_page_cgroup
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype,
-				struct mem_cgroup *memcg)
+			gfp_t gfp_mask, enum charge_type ctype,
+			struct mem_cgroup *memcg, struct bio_cgroup *biocg)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
+	struct bio_cgroup *biog;
 
 	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
 	if (unlikely(pc == NULL))
@@ -1035,18 +1037,15 @@ static int mem_cgroup_charge_common(stru
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-	if (likely(!memcg)) {
-		rcu_read_lock();
-		mem = mm_get_mem_cgroup(mm);
-		rcu_read_unlock();
-	} else {
-		mem = memcg;
-		get_mem_cgroup(mem);
-	}
+	rcu_read_lock();
+	mem = memcg ? memcg : mm_get_mem_cgroup(mm);
+	biog = biocg ? biocg : mm_get_bio_cgroup(mm);
+	rcu_read_unlock();
 
 	if (mem_cgroup_try_to_allocate(mem, gfp_mask) < 0)
 		goto out;
 	set_mem_cgroup(pc, mem);
+	set_bio_cgroup(pc, biog);
 	pc->page = page;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
@@ -1065,18 +1064,21 @@ static int mem_cgroup_charge_common(stru
 	if (unlikely(page_get_page_cgroup(page))) {
 		unlock_page_cgroup(page);
 		clear_mem_cgroup(pc);
+		clear_bio_cgroup(pc);
 		kmem_cache_free(page_cgroup_cache, pc);
 		goto done;
 	}
 	page_assign_page_cgroup(page, pc);
 
 	mem_cgroup_add_page(pc);
+	bio_cgroup_add_page(pc);
 
 	unlock_page_cgroup(page);
 done:
 	return 0;
 out:
 	put_mem_cgroup(mem);
+	put_bio_cgroup(biog);
 	kmem_cache_free(page_cgroup_cache, pc);
 err:
 	return -ENOMEM;
@@ -1099,7 +1101,7 @@ int mem_cgroup_charge(struct page *page,
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL, NULL);
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -1135,7 +1137,7 @@ int mem_cgroup_cache_charge(struct page 
 		mm = &init_mm;
 
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL, NULL);
 }
 
 /*
@@ -1146,7 +1148,7 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return;
 
 	/*
@@ -1166,11 +1168,13 @@ __mem_cgroup_uncharge_common(struct page
 		goto unlock;
 
 	mem_cgroup_remove_page(pc);
+	bio_cgroup_remove_page(pc);
 
 	page_assign_page_cgroup(page, NULL);
 	unlock_page_cgroup(page);
 
 	clear_mem_cgroup(pc);
+	clear_bio_cgroup(pc);
 
 	kmem_cache_free(page_cgroup_cache, pc);
 	return;
@@ -1196,24 +1200,29 @@ int mem_cgroup_prepare_migration(struct 
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	struct bio_cgroup *biog = NULL;
 	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
 	int ret = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return 0;
 
 	lock_page_cgroup(page);
 	pc = page_get_page_cgroup(page);
 	if (pc) {
 		mem = get_mem_page_cgroup(pc);
+		biog = get_bio_page_cgroup(pc);
 		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
 	unlock_page_cgroup(page);
-	if (mem) {
+	if (pc) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
-			ctype, mem);
-		put_mem_cgroup(mem);
+			ctype, mem, biog);
+		if (mem)
+			put_mem_cgroup(mem);
+		if (biog)
+			put_bio_cgroup(biog);
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband
  2008-08-12 12:36           ` [PATCH 6/7] bio-cgroup: Implement the bio-cgroup Ryo Tsuruta
@ 2008-08-12 12:37             ` Ryo Tsuruta
  0 siblings, 0 replies; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-12 12:37 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel
  Cc: agk, balbir, xemul, kamezawa.hiroyu, fernando

With this patch, dm-ioband can work with the bio cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c	2008-08-12 14:45:59.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c	2008-08-12 15:07:03.000000000 +0900
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include <linux/bio.h>
+#include <linux/biocontrol.h>
 #include "dm.h"
 #include "dm-bio-list.h"
 #include "dm-ioband.h"
@@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-  /*
-   * This function should return the ID of the cgroup which issued "bio".
-   * The ID of the cgroup which the current process belongs to won't be
-   * suitable ID for this purpose, since some BIOs will be handled by kernel
-   * threads like aio or pdflush on behalf of the process requesting the BIOs.
-   */
-	return 0;	/* not implemented yet */
+	struct io_context *ioc = get_bio_cgroup_iocontext(bio);
+	int id = 0;
+	if (ioc) {
+		id = ioc->id;
+		put_io_context(ioc);
+	}
+	return id;
 }
 
 struct group_type dm_ioband_group_type[] = {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-12 12:35       ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Ryo Tsuruta
  2008-08-12 12:36         ` [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s Ryo Tsuruta
@ 2008-08-18  1:39         ` KAMEZAWA Hiroyuki
  2008-08-19  3:28           ` Ryo Tsuruta
  1 sibling, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-18  1:39 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk, balbir, xemul, fernando

On Tue, 12 Aug 2008 21:35:33 +0900 (JST)
Ryo Tsuruta <ryov@valinux.co.jp> wrote:

> This patch splits the cgroup memory subsystem into two parts.
> One is for tracking pages to find out the owners. The other is
> for controlling how much amount of memory should be assigned to
> each cgroup.
> 
> With this patch, you can use the page tracking mechanism even if
> the memory subsystem is off.
> 

I'm now writing remove-lock-page-cgroup patches. it works well.
please wait for a while...

Thanks,
-Kame


> Based on 2.6.27-rc1-mm1
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> 
> diff -Ndupr linux-2.6.27-rc1-mm1.ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
> --- linux-2.6.27-rc1-mm1.ioband/include/linux/memcontrol.h	2008-08-12 14:30:19.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h	2008-08-12 14:47:11.000000000 +0900
> @@ -20,12 +20,62 @@
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
>  
> +#include <linux/rcupdate.h>
> +#include <linux/mm.h>
> +#include <linux/smp.h>
> +#include <linux/bit_spinlock.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +#ifdef CONFIG_CGROUP_PAGE
> +/*
> + * We use the lower bit of the page->page_cgroup pointer as a bit spin
> + * lock.  We need to ensure that page->page_cgroup is at least two
> + * byte aligned (based on comments from Nick Piggin).  But since
> + * bit_spin_lock doesn't actually set that lock bit in a non-debug
> + * uniprocessor kernel, we should avoid setting it here too.
> + */
> +#define PAGE_CGROUP_LOCK_BIT    0x0
> +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> +#define PAGE_CGROUP_LOCK        (1 << PAGE_CGROUP_LOCK_BIT)
> +#else
> +#define PAGE_CGROUP_LOCK        0x0
> +#endif
> +
> +/*
> + * A page_cgroup page is associated with every page descriptor. The
> + * page_cgroup helps us identify information about the cgroup
> + */
> +struct page_cgroup {
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	struct list_head lru;		/* per cgroup LRU list */
> +	struct mem_cgroup *mem_cgroup;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +	struct page *page;
> +	int flags;
> +};
> +#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
> +#define PAGE_CGROUP_FLAG_ACTIVE	(0x2)	/* page is active in this cgroup */
> +#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
> +#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
> +
> +static inline void lock_page_cgroup(struct page *page)
> +{
> +	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static inline int try_lock_page_cgroup(struct page *page)
> +{
> +	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static inline void unlock_page_cgroup(struct page *page)
> +{
> +	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
>  
>  #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
>  
> @@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page
>  				gfp_t gfp_mask);
>  extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask);
> -extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
>  extern void mem_cgroup_uncharge_page(struct page *page);
>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
> -extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
> -
> -extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> -					struct list_head *dst,
> -					unsigned long *scanned, int order,
> -					int mode, struct zone *z,
> -					struct mem_cgroup *mem_cont,
> -					int active, int file);
> -extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
> -int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> -
> -extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> -
> -#define mm_match_cgroup(mm, cgroup)	\
> -	((cgroup) == mem_cgroup_from_task((mm)->owner))
>  
>  extern int
>  mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
>  extern void mem_cgroup_end_migration(struct page *page);
> +extern void page_cgroup_init(void);
>  
> -/*
> - * For memory reclaim.
> - */
> -extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
> -extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
> -
> -extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
> -extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> -							int priority);
> -extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> -							int priority);
> -
> -extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> -					int priority, enum lru_list lru);
> -
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
>  static inline void page_reset_bad_cgroup(struct page *page)
>  {
>  }
> @@ -102,6 +122,53 @@ static inline void mem_cgroup_uncharge_c
>  {
>  }
>  
> +static inline int
> +mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +{
> +	return 0;
> +}
> +
> +static inline void mem_cgroup_end_migration(struct page *page)
> +{
> +}
> +#endif /* CONFIG_CGROUP_PAGE */
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +
> +extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> +extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
> +
> +extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> +					struct list_head *dst,
> +					unsigned long *scanned, int order,
> +					int mode, struct zone *z,
> +					struct mem_cgroup *mem_cont,
> +					int active, int file);
> +extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
> +int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> +
> +extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +
> +#define mm_match_cgroup(mm, cgroup)	\
> +	((cgroup) == mem_cgroup_from_task((mm)->owner))
> +
> +/*
> + * For memory reclaim.
> + */
> +extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
> +extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
> +
> +extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
> +extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> +							int priority);
> +extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> +							int priority);
> +
> +extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> +					int priority, enum lru_list lru);
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
>  static inline int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	return 0;
> @@ -122,16 +189,6 @@ static inline int task_in_mem_cgroup(str
>  	return 1;
>  }
>  
> -static inline int
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> -{
> -	return 0;
> -}
> -
> -static inline void mem_cgroup_end_migration(struct page *page)
> -{
> -}
> -
>  static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
>  {
>  	return 0;
> @@ -163,7 +220,8 @@ static inline long mem_cgroup_calc_recla
>  {
>  	return 0;
>  }
> -#endif /* CONFIG_CGROUP_MEM_CONT */
> +
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
>  
> diff -Ndupr linux-2.6.27-rc1-mm1.ioband/include/linux/mm_types.h linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h
> --- linux-2.6.27-rc1-mm1.ioband/include/linux/mm_types.h	2008-08-12 14:30:19.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h	2008-08-12 14:47:11.000000000 +0900
> @@ -92,7 +92,7 @@ struct page {
>  	void *virtual;			/* Kernel virtual address (NULL if
>  					   not kmapped, ie. highmem) */
>  #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
>  	unsigned long page_cgroup;
>  #endif
>  
> diff -Ndupr linux-2.6.27-rc1-mm1.ioband/init/Kconfig linux-2.6.27-rc1-mm1.cg0/init/Kconfig
> --- linux-2.6.27-rc1-mm1.ioband/init/Kconfig	2008-08-12 14:30:19.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/init/Kconfig	2008-08-12 14:47:11.000000000 +0900
> @@ -418,6 +418,10 @@ config CGROUP_MEMRLIMIT_CTLR
>  	  memory RSS and Page Cache control. Virtual address space control
>  	  is provided by this controller.
>  
> +config CGROUP_PAGE
> +	def_bool y
> +	depends on CGROUP_MEM_RES_CTLR
> +
>  config SYSFS_DEPRECATED
>  	bool
>  
> diff -Ndupr linux-2.6.27-rc1-mm1.ioband/mm/Makefile linux-2.6.27-rc1-mm1.cg0/mm/Makefile
> --- linux-2.6.27-rc1-mm1.ioband/mm/Makefile	2008-08-12 14:30:19.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/mm/Makefile	2008-08-12 14:47:11.000000000 +0900
> @@ -34,5 +34,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
>  obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
> diff -Ndupr linux-2.6.27-rc1-mm1.ioband/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c
> --- linux-2.6.27-rc1-mm1.ioband/mm/memcontrol.c	2008-08-12 14:30:19.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c	2008-08-12 14:47:11.000000000 +0900
> @@ -36,10 +36,25 @@
>  
>  #include <asm/uaccess.h>
>  
> -struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> +enum charge_type {
> +	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> +	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> +	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
> +};
> +
> +static void __mem_cgroup_uncharge_common(struct page *, enum charge_type);
> +
>  static struct kmem_cache *page_cgroup_cache __read_mostly;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +struct cgroup_subsys mem_cgroup_subsys __read_mostly;
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
>  
> +static inline int mem_cgroup_disabled(void)
> +{
> +	return mem_cgroup_subsys.disabled;
> +}
> +
>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -136,35 +151,6 @@ struct mem_cgroup {
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> -/*
> - * We use the lower bit of the page->page_cgroup pointer as a bit spin
> - * lock.  We need to ensure that page->page_cgroup is at least two
> - * byte aligned (based on comments from Nick Piggin).  But since
> - * bit_spin_lock doesn't actually set that lock bit in a non-debug
> - * uniprocessor kernel, we should avoid setting it here too.
> - */
> -#define PAGE_CGROUP_LOCK_BIT 	0x0
> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
> -#else
> -#define PAGE_CGROUP_LOCK	0x0
> -#endif
> -
> -/*
> - * A page_cgroup page is associated with every page descriptor. The
> - * page_cgroup helps us identify information about the cgroup
> - */
> -struct page_cgroup {
> -	struct list_head lru;		/* per cgroup LRU list */
> -	struct page *page;
> -	struct mem_cgroup *mem_cgroup;
> -	int flags;
> -};
> -#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
> -#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
> -#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
> -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
> -
>  static int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -175,12 +161,6 @@ static enum zone_type page_cgroup_zid(st
>  	return page_zonenum(pc->page);
>  }
>  
> -enum charge_type {
> -	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> -	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> -	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
> -};
> -
>  /*
>   * Always modified under lru lock. Then, not necessary to preempt_disable()
>   */
> @@ -248,37 +228,6 @@ struct mem_cgroup *mem_cgroup_from_task(
>  				struct mem_cgroup, css);
>  }
>  
> -static inline int page_cgroup_locked(struct page *page)
> -{
> -	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> -{
> -	VM_BUG_ON(!page_cgroup_locked(page));
> -	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> -}
> -
> -struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> -	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> -}
> -
> -static void lock_page_cgroup(struct page *page)
> -{
> -	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static int try_lock_page_cgroup(struct page *page)
> -{
> -	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void unlock_page_cgroup(struct page *page)
> -{
> -	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
>  static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
>  			struct page_cgroup *pc)
>  {
> @@ -367,7 +316,7 @@ void mem_cgroup_move_lists(struct page *
>  	struct mem_cgroup_per_zone *mz;
>  	unsigned long flags;
>  
> -	if (mem_cgroup_subsys.disabled)
> +	if (mem_cgroup_disabled())
>  		return;
>  
>  	/*
> @@ -506,273 +455,6 @@ unsigned long mem_cgroup_isolate_pages(u
>  }
>  
>  /*
> - * Charge the memory controller for page usage.
> - * Return
> - * 0 if the charge was successful
> - * < 0 if the cgroup is over its limit
> - */
> -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask, enum charge_type ctype,
> -				struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *mem;
> -	struct page_cgroup *pc;
> -	unsigned long flags;
> -	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct mem_cgroup_per_zone *mz;
> -
> -	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> -	if (unlikely(pc == NULL))
> -		goto err;
> -
> -	/*
> -	 * We always charge the cgroup the mm_struct belongs to.
> -	 * The mm_struct's mem_cgroup changes on task migration if the
> -	 * thread group leader migrates. It's possible that mm is not
> -	 * set, if so charge the init_mm (happens for pagecache usage).
> -	 */
> -	if (likely(!memcg)) {
> -		rcu_read_lock();
> -		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -		/*
> -		 * For every charge from the cgroup, increment reference count
> -		 */
> -		css_get(&mem->css);
> -		rcu_read_unlock();
> -	} else {
> -		mem = memcg;
> -		css_get(&memcg->css);
> -	}
> -
> -	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> -		if (!(gfp_mask & __GFP_WAIT))
> -			goto out;
> -
> -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> -			continue;
> -
> -		/*
> -		 * try_to_free_mem_cgroup_pages() might not give us a full
> -		 * picture of reclaim. Some pages are reclaimed and might be
> -		 * moved to swap cache or just unmapped from the cgroup.
> -		 * Check the limit again to see if the reclaim reduced the
> -		 * current usage of the cgroup before giving up
> -		 */
> -		if (res_counter_check_under_limit(&mem->res))
> -			continue;
> -
> -		if (!nr_retries--) {
> -			mem_cgroup_out_of_memory(mem, gfp_mask);
> -			goto out;
> -		}
> -	}
> -
> -	pc->mem_cgroup = mem;
> -	pc->page = page;
> -	/*
> -	 * If a page is accounted as a page cache, insert to inactive list.
> -	 * If anon, insert to active list.
> -	 */
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> -		pc->flags = PAGE_CGROUP_FLAG_CACHE;
> -		if (page_is_file_cache(page))
> -			pc->flags |= PAGE_CGROUP_FLAG_FILE;
> -		else
> -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> -	} else
> -		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> -
> -	lock_page_cgroup(page);
> -	if (unlikely(page_get_page_cgroup(page))) {
> -		unlock_page_cgroup(page);
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		css_put(&mem->css);
> -		kmem_cache_free(page_cgroup_cache, pc);
> -		goto done;
> -	}
> -	page_assign_page_cgroup(page, pc);
> -
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_add_list(mz, pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	unlock_page_cgroup(page);
> -done:
> -	return 0;
> -out:
> -	css_put(&mem->css);
> -	kmem_cache_free(page_cgroup_cache, pc);
> -err:
> -	return -ENOMEM;
> -}
> -
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> -{
> -	if (mem_cgroup_subsys.disabled)
> -		return 0;
> -
> -	/*
> -	 * If already mapped, we don't have to account.
> -	 * If page cache, page->mapping has address_space.
> -	 * But page->mapping may have out-of-use anon_vma pointer,
> -	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> -	 * is NULL.
> -  	 */
> -	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> -		return 0;
> -	if (unlikely(!mm))
> -		mm = &init_mm;
> -	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> -}
> -
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> -{
> -	if (mem_cgroup_subsys.disabled)
> -		return 0;
> -
> -	/*
> -	 * Corner case handling. This is called from add_to_page_cache()
> -	 * in usual. But some FS (shmem) precharges this page before calling it
> -	 * and call add_to_page_cache() with GFP_NOWAIT.
> -	 *
> -	 * For GFP_NOWAIT case, the page may be pre-charged before calling
> -	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
> -	 * charge twice. (It works but has to pay a bit larger cost.)
> -	 */
> -	if (!(gfp_mask & __GFP_WAIT)) {
> -		struct page_cgroup *pc;
> -
> -		lock_page_cgroup(page);
> -		pc = page_get_page_cgroup(page);
> -		if (pc) {
> -			VM_BUG_ON(pc->page != page);
> -			VM_BUG_ON(!pc->mem_cgroup);
> -			unlock_page_cgroup(page);
> -			return 0;
> -		}
> -		unlock_page_cgroup(page);
> -	}
> -
> -	if (unlikely(!mm))
> -		mm = &init_mm;
> -
> -	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> -}
> -
> -/*
> - * uncharge if !page_mapped(page)
> - */
> -static void
> -__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> -{
> -	struct page_cgroup *pc;
> -	struct mem_cgroup *mem;
> -	struct mem_cgroup_per_zone *mz;
> -	unsigned long flags;
> -
> -	if (mem_cgroup_subsys.disabled)
> -		return;
> -
> -	/*
> -	 * Check if our page_cgroup is valid
> -	 */
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (unlikely(!pc))
> -		goto unlock;
> -
> -	VM_BUG_ON(pc->page != page);
> -
> -	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> -	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
> -		|| page_mapped(page)))
> -		goto unlock;
> -
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_remove_list(mz, pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> -	page_assign_page_cgroup(page, NULL);
> -	unlock_page_cgroup(page);
> -
> -	mem = pc->mem_cgroup;
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> -	css_put(&mem->css);
> -
> -	kmem_cache_free(page_cgroup_cache, pc);
> -	return;
> -unlock:
> -	unlock_page_cgroup(page);
> -}
> -
> -void mem_cgroup_uncharge_page(struct page *page)
> -{
> -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> -}
> -
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> -{
> -	VM_BUG_ON(page_mapped(page));
> -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> -}
> -
> -/*
> - * Before starting migration, account against new page.
> - */
> -int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> -{
> -	struct page_cgroup *pc;
> -	struct mem_cgroup *mem = NULL;
> -	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> -	int ret = 0;
> -
> -	if (mem_cgroup_subsys.disabled)
> -		return 0;
> -
> -	lock_page_cgroup(page);
> -	pc = page_get_page_cgroup(page);
> -	if (pc) {
> -		mem = pc->mem_cgroup;
> -		css_get(&mem->css);
> -		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> -			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> -	}
> -	unlock_page_cgroup(page);
> -	if (mem) {
> -		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> -			ctype, mem);
> -		css_put(&mem->css);
> -	}
> -	return ret;
> -}
> -
> -/* remove redundant charge if migration failed*/
> -void mem_cgroup_end_migration(struct page *newpage)
> -{
> -	/*
> -	 * At success, page->mapping is not NULL.
> -	 * special rollback care is necessary when
> -	 * 1. at migration failure. (newpage->mapping is cleared in this case)
> -	 * 2. the newpage was moved but not remapped again because the task
> -	 *    exits and the newpage is obsolete. In this case, the new page
> -	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> -	 *    always for avoiding mess. The  page_cgroup will be removed if
> -	 *    unnecessary. File cache pages is still on radix-tree. Don't
> -	 *    care it.
> -	 */
> -	if (!newpage->mapping)
> -		__mem_cgroup_uncharge_common(newpage,
> -					 MEM_CGROUP_CHARGE_TYPE_FORCE);
> -	else if (PageAnon(newpage))
> -		mem_cgroup_uncharge_page(newpage);
> -}
> -
> -/*
>   * A call to try to shrink memory usage under specified resource controller.
>   * This is typically used for page reclaiming for shmem for reducing side
>   * effect of page allocation from shmem, which is used by some mem_cgroup.
> @@ -783,7 +465,7 @@ int mem_cgroup_shrink_usage(struct mm_st
>  	int progress = 0;
>  	int retry = MEM_CGROUP_RECLAIM_RETRIES;
>  
> -	if (mem_cgroup_subsys.disabled)
> +	if (mem_cgroup_disabled())
>  		return 0;
>  
>  	rcu_read_lock();
> @@ -1104,7 +786,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  
>  	if (unlikely((cont->parent) == NULL)) {
>  		mem = &init_mem_cgroup;
> -		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> +		page_cgroup_init();
>  	} else {
>  		mem = mem_cgroup_alloc();
>  		if (!mem)
> @@ -1188,3 +870,325 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  };
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +static inline int mem_cgroup_disabled(void)
> +{
> +	return 1;
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +static inline int page_cgroup_locked(struct page *page)
> +{
> +	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> +{
> +	VM_BUG_ON(!page_cgroup_locked(page));
> +	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> +}
> +
> +struct page_cgroup *page_get_page_cgroup(struct page *page)
> +{
> +	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> +}
> +
> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the cgroup is over its limit
> + */
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> +				gfp_t gfp_mask, enum charge_type ctype,
> +				struct mem_cgroup *memcg)
> +{
> +	struct page_cgroup *pc;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	struct mem_cgroup *mem;
> +	unsigned long flags;
> +	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct mem_cgroup_per_zone *mz;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> +	if (unlikely(pc == NULL))
> +		goto err;
> +
> +	/*
> +	 * We always charge the cgroup the mm_struct belongs to.
> +	 * The mm_struct's mem_cgroup changes on task migration if the
> +	 * thread group leader migrates. It's possible that mm is not
> +	 * set, if so charge the init_mm (happens for pagecache usage).
> +	 */
> +	if (likely(!memcg)) {
> +		rcu_read_lock();
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> +		/*
> +		 * For every charge from the cgroup, increment reference count
> +		 */
> +		css_get(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +		rcu_read_unlock();
> +	} else {
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +		mem = memcg;
> +		css_get(&memcg->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +	}
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> +		if (!(gfp_mask & __GFP_WAIT))
> +			goto out;
> +
> +		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> +			continue;
> +
> +		/*
> +		 * try_to_free_mem_cgroup_pages() might not give us a full
> +		 * picture of reclaim. Some pages are reclaimed and might be
> +		 * moved to swap cache or just unmapped from the cgroup.
> +		 * Check the limit again to see if the reclaim reduced the
> +		 * current usage of the cgroup before giving up
> +		 */
> +		if (res_counter_check_under_limit(&mem->res))
> +			continue;
> +
> +		if (!nr_retries--) {
> +			mem_cgroup_out_of_memory(mem, gfp_mask);
> +			goto out;
> +		}
> +	}
> +	pc->mem_cgroup = mem;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	pc->page = page;
> +	/*
> +	 * If a page is accounted as a page cache, insert to inactive list.
> +	 * If anon, insert to active list.
> +	 */
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> +		pc->flags = PAGE_CGROUP_FLAG_CACHE;
> +		if (page_is_file_cache(page))
> +			pc->flags |= PAGE_CGROUP_FLAG_FILE;
> +		else
> +			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> +	} else
> +		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> +
> +	lock_page_cgroup(page);
> +	if (unlikely(page_get_page_cgroup(page))) {
> +		unlock_page_cgroup(page);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +		kmem_cache_free(page_cgroup_cache, pc);
> +		goto done;
> +	}
> +	page_assign_page_cgroup(page, pc);
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	mz = page_cgroup_zoneinfo(pc);
> +	spin_lock_irqsave(&mz->lru_lock, flags);
> +	__mem_cgroup_add_list(mz, pc);
> +	spin_unlock_irqrestore(&mz->lru_lock, flags);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	unlock_page_cgroup(page);
> +done:
> +	return 0;
> +out:
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +	kmem_cache_free(page_cgroup_cache, pc);
> +err:
> +	return -ENOMEM;
> +}
> +
> +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	/*
> +	 * If already mapped, we don't have to account.
> +	 * If page cache, page->mapping has address_space.
> +	 * But page->mapping may have out-of-use anon_vma pointer,
> +	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> +	 * is NULL.
> +	 */
> +	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> +		return 0;
> +	if (unlikely(!mm))
> +		mm = &init_mm;
> +	return mem_cgroup_charge_common(page, mm, gfp_mask,
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +}
> +
> +int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +				gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	/*
> +	 * Corner case handling. This is called from add_to_page_cache()
> +	 * in usual. But some FS (shmem) precharges this page before calling it
> +	 * and call add_to_page_cache() with GFP_NOWAIT.
> +	 *
> +	 * For GFP_NOWAIT case, the page may be pre-charged before calling
> +	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
> +	 * charge twice. (It works but has to pay a bit larger cost.)
> +	 */
> +	if (!(gfp_mask & __GFP_WAIT)) {
> +		struct page_cgroup *pc;
> +
> +		lock_page_cgroup(page);
> +		pc = page_get_page_cgroup(page);
> +		if (pc) {
> +			VM_BUG_ON(pc->page != page);
> +			VM_BUG_ON(!pc->mem_cgroup);
> +			unlock_page_cgroup(page);
> +			return 0;
> +		}
> +		unlock_page_cgroup(page);
> +	}
> +
> +	if (unlikely(!mm))
> +		mm = &init_mm;
> +
> +	return mem_cgroup_charge_common(page, mm, gfp_mask,
> +				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> +}
> +
> +/*
> + * uncharge if !page_mapped(page)
> + */
> +static void
> +__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> +{
> +	struct page_cgroup *pc;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	struct mem_cgroup *mem;
> +	struct mem_cgroup_per_zone *mz;
> +	unsigned long flags;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	/*
> +	 * Check if our page_cgroup is valid
> +	 */
> +	lock_page_cgroup(page);
> +	pc = page_get_page_cgroup(page);
> +	if (unlikely(!pc))
> +		goto unlock;
> +
> +	VM_BUG_ON(pc->page != page);
> +
> +	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> +	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
> +		|| page_mapped(page)
> +		|| PageSwapCache(page)))
> +		goto unlock;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	mz = page_cgroup_zoneinfo(pc);
> +	spin_lock_irqsave(&mz->lru_lock, flags);
> +	__mem_cgroup_remove_list(mz, pc);
> +	spin_unlock_irqrestore(&mz->lru_lock, flags);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	page_assign_page_cgroup(page, NULL);
> +	unlock_page_cgroup(page);
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +	mem = pc->mem_cgroup;
> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +	kmem_cache_free(page_cgroup_cache, pc);
> +	return;
> +unlock:
> +	unlock_page_cgroup(page);
> +}
> +
> +void mem_cgroup_uncharge_page(struct page *page)
> +{
> +	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +}
> +
> +void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> +	VM_BUG_ON(page_mapped(page));
> +	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> +}
> +
> +/*
> + * Before starting migration, account against new page.
> + */
> +int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem = NULL;
> +	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> +	int ret = 0;
> +
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	lock_page_cgroup(page);
> +	pc = page_get_page_cgroup(page);
> +	if (pc) {
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +		mem = pc->mem_cgroup;
> +		css_get(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> +			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> +	}
> +	unlock_page_cgroup(page);
> +	if (mem) {
> +		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> +			ctype, mem);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +		css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +	}
> +	return ret;
> +}
> +
> +/* remove redundant charge if migration failed*/
> +void mem_cgroup_end_migration(struct page *newpage)
> +{
> +	/*
> +	 * At success, page->mapping is not NULL.
> +	 * special rollback care is necessary when
> +	 * 1. at migration failure. (newpage->mapping is cleared in this case)
> +	 * 2. the newpage was moved but not remapped again because the task
> +	 *    exits and the newpage is obsolete. In this case, the new page
> +	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> +	 *    always for avoiding mess. The  page_cgroup will be removed if
> +	 *    unnecessary. File cache pages is still on radix-tree. Don't
> +	 *    care it.
> +	 */
> +	if (!newpage->mapping)
> +		__mem_cgroup_uncharge_common(newpage,
> +					 MEM_CGROUP_CHARGE_TYPE_FORCE);
> +	else if (PageAnon(newpage))
> +		mem_cgroup_uncharge_page(newpage);
> +}
> +
> +void page_cgroup_init()
> +{
> +	if (!page_cgroup_cache)
> +		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> +}


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-18  1:39         ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts KAMEZAWA Hiroyuki
@ 2008-08-19  3:28           ` Ryo Tsuruta
  2008-08-19  4:01             ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-19  3:28 UTC (permalink / raw)
  To: kamezawa.hiroyu
  Cc: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk, balbir, xemul, fernando

Hi Kamezawa-san,

> I'm now writing remove-lock-page-cgroup patches. it works well.
> please wait for a while...

I'm looking forward to those patches.

By the way, I'm glad if memory-cgroup has a feature which can make a
page_cgroup move between cgroups with small overhead. It makes
bio-cgroup improve the accuracy of tracking down pages.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-19  3:28           ` Ryo Tsuruta
@ 2008-08-19  4:01             ` Balbir Singh
  2008-08-19 12:46               ` Hirokazu Takahashi
  0 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-08-19  4:01 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: kamezawa.hiroyu, xen-devel, containers, linux-kernel,
	virtualization, dm-devel, agk, xemul, fernando

Ryo Tsuruta wrote:
> Hi Kamezawa-san,
> 
>> I'm now writing remove-lock-page-cgroup patches. it works well.
>> please wait for a while...
> 
> I'm looking forward to those patches.
> 
> By the way, I'm glad if memory-cgroup has a feature which can make a
> page_cgroup move between cgroups with small overhead. It makes
> bio-cgroup improve the accuracy of tracking down pages.

Page movement can be a very expensive operation and is proportional to the size
of the control group. I think movement should be an optional feature, if we ever
add it.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-19  4:01             ` Balbir Singh
@ 2008-08-19 12:46               ` Hirokazu Takahashi
  2008-08-19 13:48                 ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: Hirokazu Takahashi @ 2008-08-19 12:46 UTC (permalink / raw)
  To: balbir
  Cc: ryov, kamezawa.hiroyu, xen-devel, containers, linux-kernel,
	virtualization, dm-devel, agk, xemul, fernando

Hi,

> >> I'm now writing remove-lock-page-cgroup patches. it works well.
> >> please wait for a while...
> > 
> > I'm looking forward to those patches.
> > 
> > By the way, I'm glad if memory-cgroup has a feature which can make a
> > page_cgroup move between cgroups with small overhead. It makes
> > bio-cgroup improve the accuracy of tracking down pages.
> 
> Page movement can be a very expensive operation and is proportional to the size
> of the control group. I think movement should be an optional feature, if we ever
> add it.

Yes, we should avoid moving pages as far as it is balanced fairly well
between groups.

But I want to move pages between bio-cgoups in case it started charging
quite a few I/O requests to a wrong bio-cgroup. I think it will be okay
if pages moves between bio-cgroups, but they don't need to move between
memory cgroups. I know the latter is really heavy and the effect seems
to be so limited.

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-19 12:46               ` Hirokazu Takahashi
@ 2008-08-19 13:48                 ` Balbir Singh
  2008-08-20  5:47                   ` Hirokazu Takahashi
  0 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-08-19 13:48 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: ryov, kamezawa.hiroyu, xen-devel, containers, linux-kernel,
	virtualization, dm-devel, agk, xemul, fernando

Hirokazu Takahashi wrote:
> Hi,
> 
>>>> I'm now writing remove-lock-page-cgroup patches. it works well.
>>>> please wait for a while...
>>> I'm looking forward to those patches.
>>>
>>> By the way, I'm glad if memory-cgroup has a feature which can make a
>>> page_cgroup move between cgroups with small overhead. It makes
>>> bio-cgroup improve the accuracy of tracking down pages.
>> Page movement can be a very expensive operation and is proportional to the size
>> of the control group. I think movement should be an optional feature, if we ever
>> add it.
> 
> Yes, we should avoid moving pages as far as it is balanced fairly well
> between groups.
> 
> But I want to move pages between bio-cgoups in case it started charging
> quite a few I/O requests to a wrong bio-cgroup. I think it will be okay
> if pages moves between bio-cgroups, but they don't need to move between
> memory cgroups. I know the latter is really heavy and the effect seems
> to be so limited.

I was under the impression that you wanted memory-cgroup to provide the page
movement feature. That is not on the TODO list for us now.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  2008-08-19 13:48                 ` Balbir Singh
@ 2008-08-20  5:47                   ` Hirokazu Takahashi
  0 siblings, 0 replies; 15+ messages in thread
From: Hirokazu Takahashi @ 2008-08-20  5:47 UTC (permalink / raw)
  To: balbir
  Cc: ryov, kamezawa.hiroyu, xen-devel, containers, linux-kernel,
	virtualization, dm-devel, agk, xemul, fernando

Hi,

> >>>> I'm now writing remove-lock-page-cgroup patches. it works well.
> >>>> please wait for a while...
> >>> I'm looking forward to those patches.
> >>>
> >>> By the way, I'm glad if memory-cgroup has a feature which can make a
> >>> page_cgroup move between cgroups with small overhead. It makes
> >>> bio-cgroup improve the accuracy of tracking down pages.
> >> Page movement can be a very expensive operation and is proportional to the size
> >> of the control group. I think movement should be an optional feature, if we ever
> >> add it.
> > 
> > Yes, we should avoid moving pages as far as it is balanced fairly well
> > between groups.
> > 
> > But I want to move pages between bio-cgoups in case it started charging
> > quite a few I/O requests to a wrong bio-cgroup. I think it will be okay
> > if pages moves between bio-cgroups, but they don't need to move between
> > memory cgroups. I know the latter is really heavy and the effect seems
> > to be so limited.
> 
> I was under the impression that you wanted memory-cgroup to provide the page
> movement feature. That is not on the TODO list for us now.

I think it's practically enough to implement this feature only on
bio-cgroup side and leave the memory controller untouched.
This kind of requests will happen only when making pages dirty.

Of cousrse I'd be happier if you put it on your list to implement
this as a generic feature.

Thanks,
Hirokazu Takahsahi.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples
  2008-08-04  8:52 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
@ 2008-08-04  8:52   ` Ryo Tsuruta
  0 siblings, 0 replies; 15+ messages in thread
From: Ryo Tsuruta @ 2008-08-04  8:52 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel; +Cc: agk

Here is the documentation of design overview, installation, command
reference and examples.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt
--- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt	2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,937 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same physical device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to its weight, which each job can set its own value to.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the bio-cgroup patch, which can be found at
+   http://people.valinux.co.jp/~ryov/bio-cgroup/.
+
+     +------+ +------+ +------+   +------+ +------+ +------+
+     |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
+     |  A   | |  B   | |others|   |  X   | |  Y   | |others|
+     +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+
+     +--V----+---V---+----V---+   +--V----+---V---+----V---+
+     | group | group | default|   | group | group | default|  ioband groups
+     |       |       |  group |   |       |       |  group |
+     +-------+-------+--------+   +-------+-------+--------+
+     |        ioband1         |   |       ioband2          |  ioband devices
+     +-----------|------------+   +-----------|------------+
+     +-----------V--------------+-------------V------------+
+     |                          |                          |
+     |          sdb1            |           sdb2           |  physical devices
+     +--------------------------+--------------------------+
+
+
+   --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+     Dm-ioband is flexible to configure the bandwidth settings.
+
+     Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each physical
+   device to control its bandwidth.
+
+     Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process so that it equally effects to all block
+   devices.
+
+   --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+     Every ioband device has one ioband group, which by default is called the
+   default group.
+
+     Ioband devices can also have extra ioband groups in them. Each ioband
+   group has a job to support and a weight. Proportional to the weight,
+   dm-ioband gives tokens to the group.
+
+     A group passes on I/O requests that its job issues to the underlying
+   layer so long as it has tokens left, while requests are blocked if there
+   aren't any tokens left in the group. Tokens are refilled once all of
+   groups that have requests on a given physical device use up their tokens.
+
+     There are two policies for token consumption. One is that a token is
+   consumed for each I/O request. The other is that a token is consumed for
+   each I/O sector, for example, one I/O request which consists of
+   4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose
+   either policy.
+
+     With this approach, a job running on an ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --------------------------------------------------------------------------
+
+Setup and Installation
+
+     Build a kernel with these options enabled:
+
+     CONFIG_MD
+     CONFIG_BLK_DEV_DM
+     CONFIG_DM_IOBAND
+
+
+     If compiled as module, use modprobe to load dm-ioband.
+
+     # make modules
+     # make modules_install
+     # depmod -a
+     # modprobe dm-ioband
+
+
+     "dmsetup targets" command shows all available device-mapper targets.
+   "ioband" is displayed if dm-ioband has been loaded.
+
+     # dmsetup targets
+     ioband           v1.4.0
+
+
+   --------------------------------------------------------------------------
+
+Getting started
+
+     The following is a brief description how to control the I/O bandwidth of
+   disks. In this description, we'll take one disk with two partitions as an
+   example target.
+
+   --------------------------------------------------------------------------
+
+  Create and map ioband devices
+
+     Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+   to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+   and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+   the bandwidth of the physical disk "/dev/sda" while "ioband2" can use 20%.
+
+    # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+        "weight 0 :40" | dmsetup create ioband1
+    # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+        "weight 0 :10" | dmsetup create ioband2
+
+
+     If the commands are successful then the device files
+   "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+   --------------------------------------------------------------------------
+
+  Additional bandwidth control
+
+     In this example two extra ioband groups are created on "ioband1". The
+   first group consists of all the processes with user-id 1000 and the second
+   group consists of all the processes with user-id 2000. Their weights are
+   30 and 20 respectively.
+
+    # dmsetup message ioband1 0 type user
+    # dmsetup message ioband1 0 attach 1000
+    # dmsetup message ioband1 0 attach 2000
+    # dmsetup message ioband1 0 weight 1000:30
+    # dmsetup message ioband1 0 weight 2000:20
+
+
+     Now the processes in the user-id 1000 group can use 30% ---
+   30/(30+20+40+10)*100 --- of the bandwidth of the physical disk.
+
+   Table 1. Weight assignments
+
+   +----------------------------------------------------------------+
+   | ioband device |          ioband group          | ioband weight |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 1000                   | 30            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 2000                   | 20            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | default group(the other users) | 40            |
+   |---------------+--------------------------------+---------------|
+   | ioband2       | default group                  | 10            |
+   +----------------------------------------------------------------+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband devices
+
+     Remove the ioband devices when no longer used.
+
+     # dmsetup remove ioband1
+     # dmsetup remove ioband2
+
+
+   --------------------------------------------------------------------------
+
+Command Reference
+
+  Create an ioband device
+
+   SYNOPSIS
+
+           dmsetup create IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Create an ioband device with the given name IOBAND_DEVICE.
+           Generally, dmsetup reads a table from standard input. Each line of
+           the table specifies a single target and is of the form:
+
+             start_sector num_sectors "ioband" device_file ioband_device_id \
+                 io_throttle io_limit ioband_group_type policy token_base \
+                 :weight [ioband_group_id:weight...]
+
+
+                start_sector, num_sectors
+
+                          The sector range of the underlying device where
+                        dm-ioband maps.
+
+                ioband
+
+                          Specify the string "ioband" as a target type.
+
+                device_file
+
+                          Underlying device name.
+
+                ioband_device_id
+
+                          The ID number for an ioband device. The same ID
+                        must be set among the ioband devices that share the
+                        same bandwidth, which means they work on the same
+                        physical disk.
+
+                io_throttle
+
+                          Dm-ioband starts to control the bandwidth when the
+                        number of BIOs in progress exceeds this value. If 0
+                        is specified, dm-ioband uses the default value.
+
+                io_limit
+
+                          Dm-ioband blocks all I/O requests for the
+                        IOBAND_DEVICE when the number of BIOs in progress
+                        exceeds this value. If 0 is specified, dm-ioband uses
+                        the default value.
+
+                ioband_group_type
+
+                          Specify how to evaluate the ioband group ID. The
+                        type must be one of "none", "user", "gid", "pid" or
+                        "pgrp." The type "cgroup" is enabled by applying the
+                        bio-cgroup patch. Specify "none" if you don't need
+                        any ioband groups other than the default ioband
+                        group.
+
+                policy
+
+                          Specify bandwidth control policy. A user can choose
+                        either policy "weight" or "weight-iosize."
+
+                             weight
+
+                                       This policy controls bandwidth
+                                     according to the proportional to the
+                                     weight of each ioband group based on the
+                                     number of I/O requests.
+
+                             weight-iosize
+
+                                       This policy controls bandwidth
+                                     according to the proportional to the
+                                     weight of each ioband group based on the
+                                     number of I/O sectors.
+
+                token_base
+
+                          The number of tokens which specified by token_base
+                        will be distributed to all ioband groups according to
+                        the proportional to the weight of each ioband group.
+                        If 0 is specified, dm-ioband uses the default value.
+
+                ioband_group_id:weight
+
+                          Set the weight of the ioband group specified by
+                        ioband_group_id. If ioband_group_id is omitted, the
+                        weight is assigned to the default ioband group.
+
+   EXAMPLE
+
+             Create an ioband device with the following parameters:
+
+              *   Starting sector = "0"
+
+              *   The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+              *   Target type = "ioband"
+
+              *   Underlying device name = "/dev/sda1"
+
+              *   Ioband device ID = "128"
+
+              *   I/O throttle = "10"
+
+              *   I/O limit = "400"
+
+              *   Ioband group type = "user"
+
+              *   Bandwidth control policy = "weight"
+
+              *   Token base = "2048"
+
+              *   Weight for the default ioband group = "100"
+
+              *   Weight for the ioband group 1000 = "80"
+
+              *   Weight for the ioband group 2000 = "20"
+
+              *   Ioband device name = "ioband1"
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband" \
+               "/dev/sda1 128 10 400 user weight 2048 :100 1000:80 2000:20" \
+               | dmsetup create ioband1
+
+
+             Create two device groups (ID=1,2). The bandwidths of these
+           device groups will be individually controlled.
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+               "0 0 none weight 0 :80" | dmsetup create ioband1
+             # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+               "0 0 none weight 0 :20" | dmsetup create ioband2
+             # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+               "0 0 none weight 0 :60" | dmsetup create ioband3
+             # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+               "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband device
+
+   SYNOPSIS
+
+           dmsetup remove IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Remove the specified ioband device IOBAND_DEVICE. All the band
+           groups attached to the ioband device are also removed
+           automatically.
+
+   EXAMPLE
+
+             Remove ioband device "ioband1."
+
+             # dmsetup remove ioband1
+
+
+   --------------------------------------------------------------------------
+
+  Set an ioband group type
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 type TYPE
+
+   DESCRIPTION
+
+             Set the ioband group type of the specified ioband device
+           IOBAND_DEVICE. TYPE must be one of "none", "user", "gid", "pid" or
+           "pgrp." The type "cgroup" is enabled by applying the bio-cgroup
+           patch. Once the type is set, new ioband groups can be created on
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the ioband group type of ioband device "ioband1" to "user."
+
+             # dmsetup message ioband1 0 type user
+
+
+   --------------------------------------------------------------------------
+
+  Create an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 attach ID
+
+   DESCRIPTION
+
+             Create an ioband group and attach it to IOBAND_DEVICE. ID
+           specifies user-id, group-id, process-id or process-group-id
+           depending the ioband group type of IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Create an ioband group which consists of all processes with
+           user-id 1000 and attach it to ioband device "ioband1."
+
+             # dmsetup message ioband1 0 type user
+             # dmsetup message ioband1 0 attach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Detach the ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 detach ID
+
+   DESCRIPTION
+
+             Detach the ioband group specified by ID from ioband device
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Detach the ioband group with ID "2000" from ioband device
+           "ioband2."
+
+             # dmsetup message ioband2 0 detach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Set bandwidth control policy
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 policy policy
+
+   DESCRIPTION
+
+             Set bandwidth control policy. This command applies to all ioband
+           devices which have the same ioband device ID as IOBAND_DEVICE. A
+           user can choose either policy "weight" or "weight-iosize."
+
+                weight
+
+                          This policy controls bandwidth according to the
+                        proportional to the weight of each ioband group based
+                        on the number of I/O requests.
+
+                weight-iosize
+
+                          This policy controls bandwidth according to the
+                        proportional to the weight of each ioband group based
+                        on the number of I/O sectors.
+
+   EXAMPLE
+
+             Set bandwidth control policy of ioband devices which have the
+           same ioband device ID as "ioband1" to "weight-iosize."
+
+             # dmsetup message ioband1 0 policy weight-iosize
+
+
+   --------------------------------------------------------------------------
+
+  Set the weight of an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 weight VAL
+
+           dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+   DESCRIPTION
+
+             Set the weight of the ioband group specified by ID. Set the
+           weight of the default ioband group of IOBAND_DEVICE if ID isn't
+           specified.
+
+             The following example means that "ioband1" can use 80% ---
+           40/(40+10)*100 --- of the bandwidth of the physical disk while
+           "ioband2" can use 20%.
+
+             # dmsetup message ioband1 0 weight 40
+             # dmsetup message ioband2 0 weight 10
+
+
+             The following lines have the same effect as the above:
+
+             # dmsetup message ioband1 0 weight 4
+             # dmsetup message ioband2 0 weight 1
+
+
+             VAL must be an integer larger than 0. The default value, which
+           is assigned to newly created ioband groups, is 100.
+
+   EXAMPLE
+
+             Set the weight of the default ioband group of "ioband1" to 40.
+
+             # dmsetup message ioband1 0 weight 40
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10.
+
+             # dmsetup message ioband1 0 weight 1000:10
+
+
+   --------------------------------------------------------------------------
+
+  Set the number of tokens
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 token VAL
+
+   DESCRIPTION
+
+             Set the number of tokens to VAL. According to their weight, this
+           number of tokens will be distributed to all the ioband groups on
+           the physical device to which ioband device IOBAND_DEVICE belongs
+           when they use up their tokens.
+
+             VAL must be an integer greater than 0. The default is 2048.
+
+   EXAMPLE
+
+             Set the number of tokens of the physical device to which
+           "ioband1" belongs to 256.
+
+             # dmsetup message ioband1 0 token 256
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O throttling
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+   DESCRIPTION
+
+             Set the I/O throttling value of the physical disk to which
+           ioband device IOBAND_DEVICE belongs to VAL. Dm-ioband start to
+           control the bandwidth when the number of BIOs in progress on the
+           physical disk exceeds this value.
+
+   EXAMPLE
+
+             Set the I/O throttling value of "ioband1" to 16.
+
+             # dmsetup message ioband1 0 io_throttle 16
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O limiting
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+   DESCRIPTION
+
+             Set the I/O limiting value of the physical disk to which ioband
+           device IOBAND_DEVICE belongs to VAL. Dm-ioband will block all I/O
+           requests for the physical device if the number of BIOs in progress
+           on the physical disk exceeds this value.
+
+   EXAMPLE
+
+             Set the I/O limiting value of "ioband1" to 128.
+
+             # dmsetup message ioband1 0 io_limit 128
+
+
+   --------------------------------------------------------------------------
+
+  Display settings
+
+   SYNOPSIS
+
+           dmsetup table --target ioband
+
+   DESCRIPTION
+
+             Display the current table for the ioband device in a format. See
+           "dmsetup create" command for information on the table format.
+
+   EXAMPLE
+
+             The following output shows the current table of "ioband1."
+
+             # dmsetup table --target ioband
+             ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+               2048 :100 1000:80 2000:20
+
+
+   --------------------------------------------------------------------------
+
+  Display Statistics
+
+   SYNOPSIS
+
+           dmsetup status --target ioband
+
+   DESCRIPTION
+
+             Display the statistics of all the ioband devices whose target
+           type is "ioband."
+
+             The output format is as below. the first five columns shows:
+
+              *   ioband device name
+
+              *   logical start sector of the device (must be 0)
+
+              *   device size in sectors
+
+              *   target type (must be "ioband")
+
+              *   device group ID
+
+             The remaining columns show the statistics of each ioband group
+           on the band device. Each group uses seven columns for its
+           statistics.
+
+              *   ioband group ID (-1 means default)
+
+              *   total read requests
+
+              *   delayed read requests
+
+              *   total read sectors
+
+              *   total write requests
+
+              *   delayed write requests
+
+              *   total write sectors
+
+   EXAMPLE
+
+             The following output shows the statistics of two ioband devices.
+           Ioband2 only has the default ioband group and ioband1 has three
+           (default, 1001, 1002) ioband groups.
+
+             # dmsetup status
+             ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+             ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+             166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+   --------------------------------------------------------------------------
+
+  Reset status counter
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 reset
+
+   DESCRIPTION
+
+             Reset the statistics of ioband device IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Reset the statistics of "ioband1."
+
+             # dmsetup message ioband1 0 reset
+
+
+   --------------------------------------------------------------------------
+
+Examples
+
+  Example #1: Bandwidth control on Partitions
+
+     This example describes how to control the bandwidth with disk
+   partitions. The following diagram illustrates the configuration of this
+   example. You may want to run a database on /dev/mapper/ioband1 and web
+   applications on /dev/mapper/ioband2.
+
+                 /mnt1                        /mnt2            mount points
+                   |                              |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical devices
+     +---------------------------+---------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create ioband devices with the same device group ID and assign
+       weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+             "none weight 0 :80" | dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+             "none weight 0 :40" | dmsetup create ioband2
+
+
+    2.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #2: Bandwidth control on Logical Volumes
+
+     This example is similar to the example #1 but it uses LVM logical
+   volumes instead of disk partitions. This example shows how to configure
+   ioband devices on two striped logical volumes.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |      /dev/mapper/lv0     | |     /dev/mapper/lv1      | striped logical
+     |                          | |                          | volumes
+     +-------------------------------------------------------+
+     |                          vg0                          | volume group
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/sdb         | |         /dev/sdc         | physical devices
+     +--------------------------+ +--------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Initialize the partitions for use by LVM.
+
+         # pvcreate /dev/sdb
+         # pvcreate /dev/sdc
+
+
+    2.   Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+         # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+    3.   Create two logical volumes in "vg0." The volumes have to be striped.
+
+         # lvcreate -n lv0 -i 2 -l 64 vg0 -L 1024M
+         # lvcreate -n lv1 -i 2 -l 64 vg0 -L 1024M
+
+
+         The rest is the same as the example #1.
+
+    4.   Create ioband devices corresponding to each logical volume and
+       assign weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+            "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+            dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+            "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+            dmsetup create ioband2
+
+
+    5.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #3: Bandwidth control on processes
+
+     This example describes how to control the bandwidth with groups of
+   processes. You may also want to run an additional application on the same
+   machine described in the example #1. This example shows how to add a new
+   ioband group for this application.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +-------------+------------+ +-------------+------------+
+     |          default         | |  user=1000  |   default  | ioband groups
+     |           (80)           | |     (20)    |    (40)    |   (weight)
+     +-------------+------------+ +-------------+------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical device
+     +---------------------------+---------------------------+
+
+
+     The following shows to set up a new ioband group on the machine that is
+   already configured as the example #1. The application will have a weight
+   of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+    1.   Set the type of ioband2 to "user."
+
+         # dmsetup message ioband2 0 type user.
+
+
+    2.   Create a new ioband group on ioband2.
+
+         # dmsetup message ioband2 0 attach 1000
+
+
+    3.   Assign weight of 10 to this newly created ioband group.
+
+         # dmsetup message ioband2 0 weight 1000:20
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control for Xen virtual block devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices. The following diagram illustrates the configuration of this
+   example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | physical device
+     +---------------------------+---------------------------+
+
+
+     The followings shows how to map ioband device "ioband1" and "ioband2" to
+   virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+   Virtual Machine 2" respectively on the machine configured as the example
+   #1. Add the following lines to the configuration files that are referenced
+   when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+       For "Virtual Machine 1"
+       disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+       For "Virtual Machine 2"
+       disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+   --------------------------------------------------------------------------
+
+  Example #5: Bandwidth control for Xen blktap devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices when Xen blktap devices are used. The following diagram
+   illustrates the configuration of this example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V----------------------------V------------+
+     |                  /dev/mapper/ioband1                  | ioband device
+     +---------------------------+---------------------------+
+     |       default group       |        default group      | ioband groups
+     |           (80)            |            (40)           |    (weight)
+     +-------------|-------------+--------------|------------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |  +----------V----------+      +----------V---------+  |
+     |  |       vm1.img       |      |       vm2.img      |  | disk image files
+     |  +---------------------+      +--------------------+  |
+     |                        /vmdisk                        | mount point
+     +---------------------------|---------------------------+
+                                 |
+     +---------------------------V---------------------------+
+     |                       /dev/sda1                       | physical device
+     +-------------------------------------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create an ioband device.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+             "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+    2.   Add the following lines to the configuration files that are
+       referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+       Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+    3.   Run the virtual machines.
+
+         # xm create vm1
+         # xm create vm2
+
+
+    4.   Find out the process IDs of the daemons which control the blktap
+       devices.
+
+         # lsof /vmdisk/disk[12].img
+         COMMAND   PID USER   FD   TYPE DEVICE       SIZE  NODE NAME
+         tapdisk 15011 root   11u   REG  253,0 2147483648 48961 /vmdisk/vm1.img
+         tapdisk 15276 root   13u   REG  253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+    5.   Create new ioband groups of pid 15011 and pid 15276, which are
+       process IDs of the tapdisks, and assign weight of 80 and 40 to the
+       groups respectively.
+
+         # dmsetup message ioband1 0 type pid
+         # dmsetup message ioband1 0 attach 15011
+         # dmsetup message ioband1 0 weight 15011:80
+         # dmsetup message ioband1 0 attach 15276
+         # dmsetup message ioband1 0 weight 15276:40

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-08-20  5:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-12 12:31 [PATCH 0/7] I/O bandwidth controller and BIO tracking Ryo Tsuruta
2008-08-12 12:33 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
2008-08-12 12:34   ` [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples Ryo Tsuruta
2008-08-12 12:34     ` [PATCH 3/7] bio-cgroup: Introduction Ryo Tsuruta
2008-08-12 12:35       ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Ryo Tsuruta
2008-08-12 12:36         ` [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s Ryo Tsuruta
2008-08-12 12:36           ` [PATCH 6/7] bio-cgroup: Implement the bio-cgroup Ryo Tsuruta
2008-08-12 12:37             ` [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband Ryo Tsuruta
2008-08-18  1:39         ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts KAMEZAWA Hiroyuki
2008-08-19  3:28           ` Ryo Tsuruta
2008-08-19  4:01             ` Balbir Singh
2008-08-19 12:46               ` Hirokazu Takahashi
2008-08-19 13:48                 ` Balbir Singh
2008-08-20  5:47                   ` Hirokazu Takahashi
  -- strict thread matches above, loose matches on Subject: below --
2008-08-04  8:51 [PATCH 0/7] I/O bandwidth controller and BIO tracking Ryo Tsuruta
2008-08-04  8:52 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
2008-08-04  8:52   ` [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples Ryo Tsuruta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).