All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/11] blkiocg async support
@ 2010-07-09  2:57 Munehiro Ikeda
  2010-07-09  3:14 ` [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller Munehiro Ikeda
                   ` (14 more replies)
  0 siblings, 15 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  2:57 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi, Gui Jianfeng,
	akpm, balbir, Muuhh Ikeda

These RFC patches are trial to add async (cached) write support on blkio
controller.

Only test which has been done is to compile, boot, and that write bandwidth
seems prioritized when pages which were dirtied by two different processes in
different cgroups are written back to a device simultaneously.  I know this
is the minimum (or less) test but I posted this as RFC because I would like
to hear your opinions about the design direction in the early stage.

Patches are for 2.6.35-rc4.

This patch series consists of two chunks.

(1) iotrack (patch 01/11 -- 06/11)

This is a functionality to track who dirtied a page, in exact which cgroup a
process which dirtied a page belongs to.  Blkio controller will read the info
later and prioritize when the page is actually written to a block device.
This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
proposals for IO controller.


(2) blkio controller modification (07/11 -- 11/11)

The main part of blkio controller async write support.
Currently async queues are device-wide and async write IOs are always treated
as root group.
These patches make async queues per a cfq_group per a device to control them.
Async write is handled by flush kernel thread.  Because queue pointers are
stored in cfq_io_context, io_context of the thread has to have multiple
cfq_io_contexts per a device.  So these patches make cfq_io_context per an
io_context per a cfq_group, which means per an io_context per a cgroup per a
device.


This might be a piece of puzzle for complete async write support of blkio
controller.  One of other pieces in my head is page dirtying ratio control.
I believe Andrea Righi was working on it...how about the situation?

And also, I'm thinking that async write support is required by bandwidth
capping policy of blkio controller.  Bandwidth capping can be done in upper
layer than elevator.  However I think it should be also done in elevator layer
in my opinion.  Elevator buffers and sort requests.  If there is another
buffering functionality in upper layer, it is doubled buffering and it can be
harmful for elevator's prediction.

I appreciate any comments and suggestions.


Thanks,
Muuhh


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
@ 2010-07-09  3:14 ` Munehiro Ikeda
  2010-07-26  6:49   ` Balbir Singh
  2010-07-09  3:15 ` [RFC][PATCH 02/11] blkiocg async: The main part of iotrack Munehiro Ikeda
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:14 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

This patch makes page_cgroup independent from memory controller
so that kernel functionalities other than memory controller can
use page_cgroup.

This patch is based on a patch posted from Ryo Tsuruta on Oct 2,
2009 titled "The new page_cgroup framework".

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 include/linux/mmzone.h      |    4 ++--
 include/linux/page_cgroup.h |    4 ++--
 init/Kconfig                |    4 ++++
 mm/Makefile                 |    3 ++-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..d3a9bf7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -612,7 +612,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -971,7 +971,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..6a21b0d 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -104,7 +104,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 5cff9a9..2e40f2f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -648,6 +648,10 @@ config DEBUG_BLK_CGROUP
 
 endif # CGROUPS
 
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 8982504..57b112e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,7 +40,8 @@ else
 obj-y += percpu_up.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
  2010-07-09  3:14 ` [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller Munehiro Ikeda
@ 2010-07-09  3:15 ` Munehiro Ikeda
  2010-07-09  7:35   ` KAMEZAWA Hiroyuki
  2010-07-09  7:38   ` KAMEZAWA Hiroyuki
  2010-07-09  3:16 ` [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack Munehiro Ikeda
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:15 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

iotrack is a functionality to record who dirtied the
page.  This is needed for block IO controller cgroup
to support async (cached) write.

This patch is based on a patch posted from Ryo Tsuruta
on Oct 2, 2009 titled as "The body of blkio-cgroup".
The patch added a new member on struct page_cgroup to
record cgroup ID, but this was given a negative opinion
from Kame, a maintainer of memory controller cgroup,
because this bloats the size of struct page_cgroup.

Instead, this patch takes an approach proposed by
Andrea Righi, which records cgroup ID in flags
of struct page_cgroup with bit encoding.

ToDo:
Cgroup ID of deleted cgroup will be recycled.  Further
consideration is needed.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/Kconfig.iosched       |    8 +++
 block/Makefile              |    1 +
 block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
 include/linux/page_cgroup.h |   25 ++++++++
 init/Kconfig                |    2 +-
 mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
 7 files changed, 309 insertions(+), 9 deletions(-)
 create mode 100644 block/blk-iotrack.c
 create mode 100644 include/linux/blk-iotrack.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3199b76..3ab712d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config GROUP_IOSCHED_ASYNC
+	bool "CFQ Group Scheduling for async IOs (EXPERIMENTAL)"
+	depends on CFQ_GROUP_IOSCHED && EXPERIMENTAL
+	select MM_OWNER
+	default n
+	help
+	  Enable group IO scheduling for async IOs.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/Makefile b/block/Makefile
index 0bb499a..441858d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
+obj-$(CONFIG_GROUP_IOSCHED_ASYNC) += blk-iotrack.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
diff --git a/block/blk-iotrack.c b/block/blk-iotrack.c
new file mode 100644
index 0000000..d98a09a
--- /dev/null
+++ b/block/blk-iotrack.c
@@ -0,0 +1,129 @@
+/* blk-iotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2010 Munehiro Ikeda <m-ikeda@ds.jp.nec.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
+#include <linux/rcupdate.h>
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/blk-iotrack.h>
+#include "blk-cgroup.h"
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *p)
+{
+	return cgroup_to_blkio_cgroup(task_cgroup(p, blkio_subsys_id));
+}
+
+/**
+ * blk_iotrack_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *blkcg;
+	unsigned short id = 0;	/* 0: default blkio_cgroup id */
+
+	if (blk_iotrack_disabled())
+		return 0;
+	if (!mm)
+		goto out;
+
+	rcu_read_lock();
+	blkcg = task_to_blkio_cgroup(rcu_dereference(mm->owner));
+	if (likely(blkcg))
+		id = css_id(&blkcg->css);
+	rcu_read_unlock();
+out:
+	return page_cgroup_set_owner(page, id);
+}
+
+/**
+ * blk_iotrack_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+int blk_iotrack_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	return blk_iotrack_set_owner(page, mm);
+}
+
+/**
+ * blk_iotrack_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+int blk_iotrack_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return 0;
+	if (current->flags & PF_MEMALLOC)
+		return 0;
+
+	return blk_iotrack_reset_owner(page, mm);
+}
+
+/**
+ * blk_iotrack_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+int blk_iotrack_copy_owner(struct page *npage, struct page *opage)
+{
+	if (blk_iotrack_disabled())
+		return 0;
+	return page_cgroup_copy_owner(npage, opage);
+}
+
+/**
+ * blk_iotrack_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to root blkio_cgroup.
+ */
+unsigned long blk_iotrack_cgroup_id(struct bio *bio)
+{
+	struct page *page;
+
+	if (!bio->bi_vcnt)
+		return 0;
+
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return page_cgroup_get_owner(page);
+}
+EXPORT_SYMBOL(blk_iotrack_cgroup_id);
+
diff --git a/include/linux/blk-iotrack.h b/include/linux/blk-iotrack.h
new file mode 100644
index 0000000..8021c2b
--- /dev/null
+++ b/include/linux/blk-iotrack.h
@@ -0,0 +1,62 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BLK_IOTRACK_H
+#define _LINUX_BLK_IOTRACK_H
+
+#ifdef CONFIG_GROUP_IOSCHED_ASYNC
+
+/**
+ * blk_iotrack_disabled() - check whether block IO tracking is disabled
+ * Returns true if disabled, false if not.
+ */
+static inline bool blk_iotrack_disabled(void)
+{
+	if (blkio_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm);
+extern int blk_iotrack_reset_owner(struct page *page, struct mm_struct *mm);
+extern int blk_iotrack_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern int blk_iotrack_copy_owner(struct page *page, struct page *opage);
+extern unsigned long blk_iotrack_cgroup_id(struct bio *bio);
+
+#else /* !CONFIG_GROUP_IOSCHED_ASYNC */
+
+static inline bool blk_iotrack_disabled(void)
+{
+	return true;
+}
+
+static inline int blk_iotrack_set_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline int blk_iotrack_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline int blk_iotrack_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline int blk_iotrack_copy_owner(struct page *page,
+						struct page *opage)
+{
+}
+
+static inline unsigned long blk_iotrack_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED_ASYNC */
+
+#endif /* _LINUX_BLK_IOTRACK_H */
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 6a21b0d..473b79a 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -17,6 +17,31 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PAGE_TRACKING_ID_SHIFT	(16)
+#define PAGE_TRACKING_ID_BITS \
+		(8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() lock held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PAGE_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() lock held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
+}
+
+unsigned long page_cgroup_get_owner(struct page *page);
+int page_cgroup_set_owner(struct page *page, unsigned long id);
+int page_cgroup_copy_owner(struct page *npage, struct page *opage);
+
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
 
 #ifdef CONFIG_SPARSEMEM
diff --git a/init/Kconfig b/init/Kconfig
index 2e40f2f..337ee01 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -650,7 +650,7 @@ endif # CGROUPS
 
 config CGROUP_PAGE
 	def_bool y
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR || GROUP_IOSCHED_ASYNC
 
 config MM_OWNER
 	bool
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 6c00814..69e080c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/blk-iotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blk_iotrack_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -251,7 +253,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blk_iotrack_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -260,14 +262,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -277,6 +280,78 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
 
 #endif
 
+/**
+ * page_cgroup_get_owner() - get the owner ID of a page
+ * @page:	the page we want to find the owner
+ *
+ * Returns the owner ID of the page, 0 means that the owner cannot be
+ * retrieved.
+ **/
+unsigned long page_cgroup_get_owner(struct page *page)
+{
+	struct page_cgroup *pc;
+	unsigned long ret;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return 0;
+
+	lock_page_cgroup(pc);
+	ret = page_cgroup_get_id(pc);
+	unlock_page_cgroup(pc);
+	return ret;
+}
+
+/**
+ * page_cgroup_set_owner() - set the owner ID of a page
+ * @page:	the page we want to tag
+ * @id:		the ID number that will be associated to page
+ *
+ * Returns 0 if the owner is correctly associated to the page. Returns a
+ * negative value in case of failure.
+ **/
+int page_cgroup_set_owner(struct page *page, unsigned long id)
+{
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return -ENOENT;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+	return 0;
+}
+
+/**
+ * page_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Returns 0 if the owner is correctly associated to npage. Returns a negative
+ * value in case of failure.
+ **/
+int page_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return -ENOENT;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return -ENOENT;
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+
+	return 0;
+}
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
  2010-07-09  3:14 ` [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller Munehiro Ikeda
  2010-07-09  3:15 ` [RFC][PATCH 02/11] blkiocg async: The main part of iotrack Munehiro Ikeda
@ 2010-07-09  3:16 ` Munehiro Ikeda
  2010-07-09  9:24   ` Andrea Righi
  2010-07-09  3:16 ` [RFC][PATCH 04/11] blkiocg async: block_commit_write not to record process info Munehiro Ikeda
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:16 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

Embedding hooks for iotrack to record process info, namely
cgroup ID.
This patch is based on a patch posted from Ryo Tsuruta on Oct 2,
2009 titled "Page tracking hooks".

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 fs/buffer.c         |    2 ++
 fs/direct-io.c      |    2 ++
 mm/bounce.c         |    2 ++
 mm/filemap.c        |    2 ++
 mm/memory.c         |    5 +++++
 mm/page-writeback.c |    2 ++
 mm/swap_state.c     |    2 ++
 7 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d54812b..c418fdf 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/blk-iotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -667,6 +668,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blk_iotrack_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7600aac..2c1f42f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/blk-iotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -846,6 +847,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blk_iotrack_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/mm/bounce.c b/mm/bounce.c
index 13b6dad..04339df 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/blk-iotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -211,6 +212,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blk_iotrack_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 20e5642..a255d0c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/blk-iotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -405,6 +406,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blk_iotrack_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memory.c b/mm/memory.c
index 119b7cc..3eb2d0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -52,6 +52,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/blk-iotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2283,6 +2284,7 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blk_iotrack_set_owner(new_page, mm);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2718,6 +2720,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blk_iotrack_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2795,6 +2798,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	blk_iotrack_set_owner(page, mm);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2949,6 +2953,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter_fast(mm, MM_ANONPAGES);
 			page_add_new_anon_rmap(page, vma, address);
+			blk_iotrack_set_owner(page, mm);
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 54f28bd..f3e6b2c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-iotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1128,6 +1129,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blk_iotrack_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e10f583..ab26978 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/blk-iotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -324,6 +325,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blk_iotrack_set_owner(new_page, current->mm);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 04/11] blkiocg async: block_commit_write not to record process info
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (2 preceding siblings ...)
  2010-07-09  3:16 ` [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack Munehiro Ikeda
@ 2010-07-09  3:16 ` Munehiro Ikeda
  2010-07-09  3:17 ` [RFC][PATCH 05/11] blkiocg async: __set_page_dirty_nobuffer " Munehiro Ikeda
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:16 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

When a mmap(2)'d page is written back, which means the page doesn't
have buffer_head, ext4 prepares buffer_heads and calls
block_commit_write() from ext4_writepage().
This results to call mark_buffer_dirty() and the page's dirty flag
is set.  In this case, current process marking page dirty is (almost)
flush kernel thread, so the original info of a process which dirtied
this page is lost.

To prevent this issue, this patch introduces
block_commit_write_noiotrack() which is same as block_commit_write()
but runs through a code path not to record current process info.

The portion calling block_commit_write() in ext4 will be modified
in the following patch.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 fs/buffer.c                 |   70 ++++++++++++++++++++++++++++++++-----------
 include/linux/buffer_head.h |    2 +
 2 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c418fdf..61ebf94 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -660,15 +660,17 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  *
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
+ * If track is true, dirtying process info is recorded for iotrack.
  */
 static void __set_page_dirty(struct page *page,
-		struct address_space *mapping, int warn)
+		struct address_space *mapping, int warn, int track)
 {
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
-		blk_iotrack_reset_owner_pagedirty(page, current->mm);
+		if (track)
+			blk_iotrack_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
@@ -723,7 +725,7 @@ int __set_page_dirty_buffers(struct page *page)
 	spin_unlock(&mapping->private_lock);
 
 	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
+		__set_page_dirty(page, mapping, 1, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1137,18 +1139,11 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  */
 
 /**
- * mark_buffer_dirty - mark a buffer_head as needing writeout
+ * __mark_buffer_dirty - helper function for mark_buffer_dirty*
  * @bh: the buffer_head to mark dirty
- *
- * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
- * backing page dirty, then tag the page as dirty in its address_space's radix
- * tree and then attach the address_space's inode to its superblock's dirty
- * inode list.
- *
- * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * @track: if true, dirtying process info will be recorded for iotrack
  */
-void mark_buffer_dirty(struct buffer_head *bh)
+static void __mark_buffer_dirty(struct buffer_head *bh, int track)
 {
 	WARN_ON_ONCE(!buffer_uptodate(bh));
 
@@ -1169,12 +1164,40 @@ void mark_buffer_dirty(struct buffer_head *bh)
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
-				__set_page_dirty(page, mapping, 0);
+				__set_page_dirty(page, mapping, 0, track);
 		}
 	}
 }
+
+/**
+ * mark_buffer_dirty - mark a buffer_head as needing writeout
+ * @bh: the buffer_head to mark dirty
+ *
+ * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
+ * backing page dirty, then tag the page as dirty in its address_space's radix
+ * tree and then attach the address_space's inode to its superblock's dirty
+ * inode list.
+ *
+ * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
+ * mapping->tree_lock and the global inode_lock.
+ */
+void mark_buffer_dirty(struct buffer_head *bh)
+{
+	__mark_buffer_dirty(bh, 1);
+}
 EXPORT_SYMBOL(mark_buffer_dirty);
 
+/**
+ * mark_buffer_dirty_noiotrack
+ * - same as mark_buffer_dirty but doesn't record dirtying process info
+ * @bh: the buffer_head to mark dirty
+ */
+void mark_buffer_dirty_noiotrack(struct buffer_head *bh)
+{
+	__mark_buffer_dirty(bh, 0);
+}
+EXPORT_SYMBOL(mark_buffer_dirty_noiotrack);
+
 /*
  * Decrement a buffer_head's reference count.  If all buffers against a page
  * have zero reference count, are clean and unlocked, and if the page is clean
@@ -1916,7 +1939,7 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
 }
 
 static int __block_commit_write(struct inode *inode, struct page *page,
-		unsigned from, unsigned to)
+		unsigned from, unsigned to, int track)
 {
 	unsigned block_start, block_end;
 	int partial = 0;
@@ -1934,7 +1957,10 @@ static int __block_commit_write(struct inode *inode, struct page *page,
 				partial = 1;
 		} else {
 			set_buffer_uptodate(bh);
-			mark_buffer_dirty(bh);
+			if (track)
+				mark_buffer_dirty(bh);
+			else
+				mark_buffer_dirty_noiotrack(bh);
 		}
 		clear_buffer_new(bh);
 	}
@@ -2067,7 +2093,7 @@ int block_write_end(struct file *file, struct address_space *mapping,
 	flush_dcache_page(page);
 
 	/* This could be a short (even 0-length) commit */
-	__block_commit_write(inode, page, start, start+copied);
+	__block_commit_write(inode, page, start, start+copied, 1);
 
 	return copied;
 }
@@ -2414,11 +2440,19 @@ EXPORT_SYMBOL(block_prepare_write);
 int block_commit_write(struct page *page, unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
-	__block_commit_write(inode,page,from,to);
+	__block_commit_write(inode, page, from, to, 1);
 	return 0;
 }
 EXPORT_SYMBOL(block_commit_write);
 
+int block_commit_write_noiotrack(struct page *page, unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	__block_commit_write(inode, page, from, to, 0);
+	return 0;
+}
+EXPORT_SYMBOL(block_commit_write_noiotrack);
+
 /*
  * block_page_mkwrite() is not allowed to change the file size as it gets
  * called from a page fault handler when a page is first dirtied. Hence we must
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 1b9ba19..9d7e0b0 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -145,6 +145,7 @@ BUFFER_FNS(Unwritten, unwritten)
  */
 
 void mark_buffer_dirty(struct buffer_head *bh);
+void mark_buffer_dirty_noiotrack(struct buffer_head *bh);
 void init_buffer(struct buffer_head *, bh_end_io_t *, void *);
 void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset);
@@ -225,6 +226,7 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
 			get_block_t *, loff_t *);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_commit_write_noiotrack(struct page *page, unsigned from, unsigned to);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 void block_sync_page(struct page *);
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 05/11] blkiocg async: __set_page_dirty_nobuffer not to record process info
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (3 preceding siblings ...)
  2010-07-09  3:16 ` [RFC][PATCH 04/11] blkiocg async: block_commit_write not to record process info Munehiro Ikeda
@ 2010-07-09  3:17 ` Munehiro Ikeda
  2010-07-09  3:17 ` [RFC][PATCH 06/11] blkiocg async: ext4_writepage not to overwrite iotrack info Munehiro Ikeda
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:17 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Andrea Righi, Gui Jianfeng, akpm, balbir

For same reason of introducing block_commit_write_noiotrack()
in the previous patch, this patch introduces
__set_page_dirty_nobuffer_noiotrack().
redirty_page_for_writepage() calls
__set_page_dirty_nobuffers_noiotrack()
because overwriting the process info for iotrack isn't needed
when redirtying.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   55 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b969efb..08a957b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -851,6 +851,7 @@ extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
 
 int __set_page_dirty_nobuffers(struct page *page);
+int __set_page_dirty_nobuffers_noiotrack(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
 				struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f3e6b2c..bdd6fdb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1099,22 +1099,11 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 	}
 }
 
-/*
- * For address_spaces which do not use buffers.  Just tag the page as dirty in
- * its radix tree.
- *
- * This is also used when a single buffer is being dirtied: we want to set the
- * page dirty in that case, but not all the buffers.  This is a "bottom-up"
- * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
- *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+/**
+ * ____set_page_dirty_nobuffers - helper function for __set_page_dirty_nobuffers*
+ * If track is true, dirtying process info will be recorded for iotrack
  */
-int __set_page_dirty_nobuffers(struct page *page)
+static int ____set_page_dirty_nobuffers(struct page *page, int track)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
@@ -1129,7 +1118,9 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
-			blk_iotrack_reset_owner_pagedirty(page, current->mm);
+			if (track)
+				blk_iotrack_reset_owner_pagedirty(page,
+					current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
@@ -1142,8 +1133,38 @@ int __set_page_dirty_nobuffers(struct page *page)
 	}
 	return 0;
 }
+
+/*
+ * For address_spaces which do not use buffers.  Just tag the page as dirty in
+ * its radix tree.
+ *
+ * This is also used when a single buffer is being dirtied: we want to set the
+ * page dirty in that case, but not all the buffers.  This is a "bottom-up"
+ * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
+ *
+ * Most callers have locked the page, which pins the address_space in memory.
+ * But zap_pte_range() does not lock the page, however in that case the
+ * mapping is pinned by the vma's ->vm_file reference.
+ *
+ * We take care to handle the case where the page was truncated from the
+ * mapping by re-checking page_mapping() inside tree_lock.
+ */
+int __set_page_dirty_nobuffers(struct page *page)
+{
+	return ____set_page_dirty_nobuffers(page, 1);
+}
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+/**
+ * Same as __set_page_dirty_nobuffers, but doesn't record process
+ * info for iotrack.
+ */
+int __set_page_dirty_nobuffers_noiotrack(struct page *page)
+{
+	return ____set_page_dirty_nobuffers(page, 0);
+}
+EXPORT_SYMBOL(__set_page_dirty_nobuffers_noiotrack);
+
 /*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
@@ -1152,7 +1173,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
-	return __set_page_dirty_nobuffers(page);
+	return __set_page_dirty_nobuffers_noiotrack(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
 
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 06/11] blkiocg async: ext4_writepage not to overwrite iotrack info
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (4 preceding siblings ...)
  2010-07-09  3:17 ` [RFC][PATCH 05/11] blkiocg async: __set_page_dirty_nobuffer " Munehiro Ikeda
@ 2010-07-09  3:17 ` Munehiro Ikeda
  2010-07-09  3:18 ` [RFC][PATCH 07/11] blkiocg async: Pass bio to elevator_ops functions Munehiro Ikeda
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:17 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

This patch is for ext4 to utilize iotrack.

ext4_writepage() calls block_commit_write() if the page doesn't
has buffer.  This overwrites original iotrack info who dirtied
the page.

To prevent this, this patch changes block_commit_write()
call into block_commit_write_noiotrack() call in ext4_writepage().

ToDo:
To check if there is fs which overwrites the info other than ext4.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 fs/ext4/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..45a2d51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2737,7 +2737,7 @@ static int ext4_writepage(struct page *page,
 			return 0;
 		}
 		/* now mark the buffer_heads as dirty and uptodate */
-		block_commit_write(page, 0, len);
+		block_commit_write_noiotrack(page, 0, len);
 	}
 
 	if (PageChecked(page) && ext4_should_journal_data(inode)) {
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 07/11] blkiocg async: Pass bio to elevator_ops functions
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (5 preceding siblings ...)
  2010-07-09  3:17 ` [RFC][PATCH 06/11] blkiocg async: ext4_writepage not to overwrite iotrack info Munehiro Ikeda
@ 2010-07-09  3:18 ` Munehiro Ikeda
  2010-07-09  3:19 ` [RFC][PATCH 08/11] blkiocg async: Function to search blkcg from css ID Munehiro Ikeda
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:18 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

blkio controller (blkio cgroup) needs to know the information embedded
in struct page_cgroup to control async (cached) write IOs.  The
information is used to find or allocate proper cfq_queue, which should
be done by elevator_set_req_fn and elevator_may_queue_fn methods of
struct elevator_ops.

This patch adds bio as a parameter of these methods for the
preparation of blkio controller's capability for async write.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/blk-core.c         |    9 +++++----
 block/cfq-iosched.c      |    5 +++--
 block/elevator.c         |    9 +++++----
 include/linux/elevator.h |   10 ++++++----
 4 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f0640d7..2637a78 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -649,7 +649,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags,
+			int priv, gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -661,7 +662,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -752,7 +753,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
 
-	may_queue = elv_may_queue(q, rw_flags);
+	may_queue = elv_may_queue(q, bio, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
@@ -802,7 +803,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..cd4c0bd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3470,7 +3470,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 	return ELV_MQUEUE_MAY;
 }
 
-static int cfq_may_queue(struct request_queue *q, int rw)
+static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct task_struct *tsk = current;
@@ -3560,7 +3560,8 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+		gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
diff --git a/block/elevator.c b/block/elevator.c
index 923a913..62daff5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -787,12 +787,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -806,12 +807,12 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 		e->ops->elevator_put_req_fn(rq);
 }
 
-int elv_may_queue(struct request_queue *q, int rw)
+int elv_may_queue(struct request_queue *q, struct bio *bio, int rw)
 {
 	struct elevator_queue *e = q->elevator;
 
 	if (e->ops->elevator_may_queue_fn)
-		return e->ops->elevator_may_queue_fn(q, rw);
+		return e->ops->elevator_may_queue_fn(q, bio, rw);
 
 	return ELV_MQUEUE_MAY;
 }
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c958f4..0194e0a 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,9 +23,10 @@ typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_queue_empty_fn) (struct request_queue *);
 typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
-typedef int (elevator_may_queue_fn) (struct request_queue *, int);
+typedef int (elevator_may_queue_fn) (struct request_queue *, struct bio *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+					struct bio *, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -115,10 +116,11 @@ extern struct request *elv_former_request(struct request_queue *, struct request
 extern struct request *elv_latter_request(struct request_queue *, struct request *);
 extern int elv_register_queue(struct request_queue *q);
 extern void elv_unregister_queue(struct request_queue *q);
-extern int elv_may_queue(struct request_queue *, int);
+extern int elv_may_queue(struct request_queue *, struct bio *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+				struct bio *, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 08/11] blkiocg async: Function to search blkcg from css ID
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (6 preceding siblings ...)
  2010-07-09  3:18 ` [RFC][PATCH 07/11] blkiocg async: Pass bio to elevator_ops functions Munehiro Ikeda
@ 2010-07-09  3:19 ` Munehiro Ikeda
  2010-07-09  3:20 ` [RFC][PATCH 09/11] blkiocg async: Functions to get cfqg from bio Munehiro Ikeda
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:19 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

iotrack stores css ID of blkcg in page_cgroup.  The blkcg is what
the process which dirtied the page belongs to.
This patch introduces a function to search blkcg from the recorded
ID.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/blk-cgroup.c |   13 +++++++++++++
 block/blk-cgroup.h |    1 +
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a680964..0756910 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -408,6 +408,19 @@ struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
 }
 EXPORT_SYMBOL_GPL(blkiocg_lookup_group);
 
+/* called under rcu_read_lock(). */
+struct blkio_cgroup *blkiocg_id_to_blkcg(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_lookup(&blkio_subsys, id);
+	if (!css)
+		return NULL;
+
+	return container_of(css, struct blkio_cgroup, css);
+}
+EXPORT_SYMBOL_GPL(blkiocg_id_to_blkcg);
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2b866ec..75c73bd 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -216,6 +216,7 @@ extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
+extern struct blkio_cgroup *blkiocg_id_to_blkcg(unsigned short id);
 void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 					unsigned long time);
 void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 09/11] blkiocg async: Functions to get cfqg from bio
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (7 preceding siblings ...)
  2010-07-09  3:19 ` [RFC][PATCH 08/11] blkiocg async: Function to search blkcg from css ID Munehiro Ikeda
@ 2010-07-09  3:20 ` Munehiro Ikeda
  2010-07-09  3:22 ` [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group Munehiro Ikeda
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:20 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

Functions to get cfq_group from bio.
This patch contains only the functionality to get cfq_group from
bio. cfq_get_cfqg() is always called with bio==NULL and returns
cfq_group of current process at this point.  So CFQ actual behavior
is not changed yet.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/cfq-iosched.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cd4c0bd..68f76e9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include <linux/blk-iotrack.h>
 #include "cfq.h"
 
 /*
@@ -1007,23 +1008,85 @@ done:
 }
 
 /*
- * Search for the cfq group current task belongs to. If create = 1, then also
- * create the cfq group if it does not exist. request_queue lock must be held.
+ * Body of searching for and alloc cfq group which is corresponding
+ * to cfqd and blockio-cgroup. If create = 1, then also create the
+ * cfq group if it does not exist. request_queue lock must be held.
  */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *
+cfq_get_cfqg_blkcg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
+			int create)
+{
+	struct cfq_group *cfqg = NULL;
+
+	cfqg = cfq_find_alloc_cfqg(cfqd, blkcg, create);
+	if (!cfqg && create)
+		cfqg = &cfqd->root_group;
+
+	return cfqg;
+}
+
+/*
+ * Search for the cfq group current task belongs to.
+ * If create = 1, then also create the cfq group if it does not
+ * exist.  request_queue lock must be held.
+ */
+static struct cfq_group *
+cfq_get_cfqg_current(struct cfq_data *cfqd, int create)
 {
 	struct cgroup *cgroup;
+	struct blkio_cgroup *blkcg;
 	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
+
 	cgroup = task_cgroup(current, blkio_subsys_id);
-	cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
-	if (!cfqg && create)
+	blkcg = cgroup_to_blkio_cgroup(cgroup);
+	cfqg = cfq_get_cfqg_blkcg(cfqd, blkcg, create);
+
+	rcu_read_unlock();
+
+	return cfqg;
+}
+
+/*
+ * Search for the cfq group corresponding to bio.
+ * If bio is for sync, the cfq group is one which current task
+ * belongs to.  If cfq group doesn't exist, it is created.
+ * request_queue lock must be held.
+ */
+static struct cfq_group *
+cfq_get_cfqg_bio(struct cfq_data *cfqd, struct bio *bio, int create)
+{
+	unsigned short blkcg_id;
+	struct blkio_cgroup *blkcg;
+	struct cfq_group *cfqg;
+
+	if (!bio || cfq_bio_sync(bio))
+		return cfq_get_cfqg_current(cfqd, create);
+
+	rcu_read_lock();
+
+	blkcg_id = blk_iotrack_cgroup_id(bio);
+	blkcg = blkiocg_id_to_blkcg(blkcg_id);
+	if (blkcg)
+		cfqg = cfq_get_cfqg_blkcg(cfqd, blkcg, create);
+	else
 		cfqg = &cfqd->root_group;
+
 	rcu_read_unlock();
 	return cfqg;
 }
 
+static inline struct cfq_group *
+cfq_get_cfqg(struct cfq_data *cfqd, struct bio *bio, int create)
+{
+#ifdef CONFIG_GROUP_IOSCHED_ASYNC
+	return cfq_get_cfqg_bio(cfqd, bio, create);
+#else
+	return cfq_get_cfqg_current(cfqd, create);
+#endif
+}
+
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 {
 	atomic_inc(&cfqg->ref);
@@ -1109,7 +1172,8 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 }
 
 #else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *
+cfq_get_cfqg(struct cfq_data *cfqd, struct bio *bio, int create)
 {
 	return &cfqd->root_group;
 }
@@ -2822,7 +2886,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 	struct cfq_group *cfqg;
 
 retry:
-	cfqg = cfq_get_cfqg(cfqd, 1);
+	cfqg = cfq_get_cfqg(cfqd, NULL, 1);
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (8 preceding siblings ...)
  2010-07-09  3:20 ` [RFC][PATCH 09/11] blkiocg async: Functions to get cfqg from bio Munehiro Ikeda
@ 2010-07-09  3:22 ` Munehiro Ikeda
  2010-08-13  1:24   ` Nauman Rafique
  2010-07-09  3:23 ` [RFC][PATCH 11/11] blkiocg async: Workload timeslice adjustment for async queues Munehiro Ikeda
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:22 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

This is the main body for async capability of blkio controller.
The basic ideas are
  - To move async queues from cfq_data to cfq_group, and
  - To associate cfq_io_context with cfq_group

Each cfq_io_context, which was created per an io_context
per a device, is now created per an io_context per a cfq_group.
Each cfq_group is created per a cgroup per a device, so
cfq_io_context is now resulted in to be created per
an io_context per a cgroup per a device.

To protect link between cfq_io_context and cfq_group,
locking code is modified in several parts.

ToDo:
- Lock protection still might be imperfect and more investigation
  is needed.

- cic_index of root cfq_group (cfqd->root_group.cic_index) is not
  removed and lasts beyond elevator switching.
  This issues should be solved.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/cfq-iosched.c       |  277 ++++++++++++++++++++++++++++-----------------
 include/linux/iocontext.h |    2 +-
 2 files changed, 175 insertions(+), 104 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 68f76e9..4186c30 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -191,10 +191,23 @@ struct cfq_group {
 	struct cfq_rb_root service_trees[2][3];
 	struct cfq_rb_root service_tree_idle;
 
+	/*
+	 * async queue for each priority case
+	 */
+	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
+	struct cfq_queue *async_idle_cfqq;
+
 	unsigned long saved_workload_slice;
 	enum wl_type_t saved_workload;
 	enum wl_prio_t saved_serving_prio;
 	struct blkio_group blkg;
+	struct cfq_data *cfqd;
+
+	/* lock for cic_list */
+	spinlock_t lock;
+	unsigned int cic_index;
+	struct list_head cic_list;
+
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	struct hlist_node cfqd_node;
 	atomic_t ref;
@@ -254,12 +267,6 @@ struct cfq_data {
 	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -275,8 +282,6 @@ struct cfq_data {
 	unsigned int cfq_latency;
 	unsigned int cfq_group_isolation;
 
-	unsigned int cic_index;
-	struct list_head cic_list;
 
 	/*
 	 * Fallback dummy cfqq for extreme OOM conditions
@@ -418,10 +423,16 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
+static void __cfq_exit_single_io_context(struct cfq_data *,
+					 struct cfq_group *,
+					 struct cfq_io_context *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
-				       struct io_context *, gfp_t);
+				       struct cfq_io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
-						struct io_context *);
+					     struct cfq_group *,
+					     struct io_context *);
+static void cfq_put_async_queues(struct cfq_group *);
+static int cfq_alloc_cic_index(void);
 
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 					    bool is_sync)
@@ -438,17 +449,28 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
 #define CIC_DEAD_KEY	1ul
 #define CIC_DEAD_INDEX_SHIFT	1
 
-static inline void *cfqd_dead_key(struct cfq_data *cfqd)
+static inline void *cfqg_dead_key(struct cfq_group *cfqg)
 {
-	return (void *)(cfqd->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
+	return (void *)(cfqg->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
+}
+
+static inline struct cfq_group *cic_to_cfqg(struct cfq_io_context *cic)
+{
+	struct cfq_group *cfqg = cic->key;
+
+	if (unlikely((unsigned long) cfqg & CIC_DEAD_KEY))
+		cfqg = NULL;
+
+	return cfqg;
 }
 
 static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
 {
-	struct cfq_data *cfqd = cic->key;
+	struct cfq_data *cfqd = NULL;
+	struct cfq_group *cfqg = cic_to_cfqg(cic);
 
-	if (unlikely((unsigned long) cfqd & CIC_DEAD_KEY))
-		return NULL;
+	if (likely(cfqg))
+		cfqd =  cfqg->cfqd;
 
 	return cfqd;
 }
@@ -959,12 +981,12 @@ cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
 }
 
 static struct cfq_group *
-cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
+		    int create)
 {
-	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
 	struct cfq_group *cfqg = NULL;
 	void *key = cfqd;
-	int i, j;
+	int i, j, idx;
 	struct cfq_rb_root *st;
 	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	unsigned int major, minor;
@@ -978,14 +1000,21 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 	if (cfqg || !create)
 		goto done;
 
+	idx = cfq_alloc_cic_index();
+	if (idx < 0)
+		goto done;
+
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
 	if (!cfqg)
-		goto done;
+		goto err;
 
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
 	RB_CLEAR_NODE(&cfqg->rb_node);
 
+	spin_lock_init(&cfqg->lock);
+	INIT_LIST_HEAD(&cfqg->cic_list);
+
 	/*
 	 * Take the initial reference that will be released on destroy
 	 * This can be thought of a joint reference by cgroup and
@@ -1002,7 +1031,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 
 	/* Add group on cfqd list */
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+	cfqg->cfqd = cfqd;
+	cfqg->cic_index = idx;
+	goto done;
 
+err:
+	spin_lock(&cic_index_lock);
+	ida_remove(&cic_index_ida, idx);
+	spin_unlock(&cic_index_lock);
 done:
 	return cfqg;
 }
@@ -1095,10 +1131,6 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 
 static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
 {
-	/* Currently, all async queues are mapped to root group */
-	if (!cfq_cfqq_sync(cfqq))
-		cfqg = &cfqq->cfqd->root_group;
-
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
 	atomic_inc(&cfqq->cfqg->ref);
@@ -1114,6 +1146,16 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
+
+	/**
+	 * ToDo:
+	 * root_group never reaches here and cic_index is never
+	 * removed.  Unused cic_index lasts by elevator switching.
+	 */
+	spin_lock(&cic_index_lock);
+	ida_remove(&cic_index_ida, cfqg->cic_index);
+	spin_unlock(&cic_index_lock);
+
 	kfree(cfqg);
 }
 
@@ -1122,6 +1164,15 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	/* Something wrong if we are trying to remove same group twice */
 	BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
 
+	while (!list_empty(&cfqg->cic_list)) {
+		struct cfq_io_context *cic = list_entry(cfqg->cic_list.next,
+							struct cfq_io_context,
+							group_list);
+
+		__cfq_exit_single_io_context(cfqd, cfqg, cic);
+	}
+
+	cfq_put_async_queues(cfqg);
 	hlist_del_init(&cfqg->cfqd_node);
 
 	/*
@@ -1497,10 +1548,12 @@ static struct request *
 cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 {
 	struct task_struct *tsk = current;
+	struct cfq_group *cfqg;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
 
-	cic = cfq_cic_lookup(cfqd, tsk->io_context);
+	cfqg = cfq_get_cfqg(cfqd, bio, 0);
+	cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
 	if (!cic)
 		return NULL;
 
@@ -1611,6 +1664,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_group *cfqg;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
 
@@ -1624,7 +1678,8 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	 * Lookup the cfqq that this bio will be queued with. Allow
 	 * merge only if rq is queued there.
 	 */
-	cic = cfq_cic_lookup(cfqd, current->io_context);
+	cfqg = cfq_get_cfqg(cfqd, bio, 0);
+	cic = cfq_cic_lookup(cfqd, cfqg, current->io_context);
 	if (!cic)
 		return false;
 
@@ -2667,17 +2722,18 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 }
 
 static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
+					 struct cfq_group *cfqg,
 					 struct cfq_io_context *cic)
 {
 	struct io_context *ioc = cic->ioc;
 
-	list_del_init(&cic->queue_list);
+	list_del_init(&cic->group_list);
 
 	/*
 	 * Make sure dead mark is seen for dead queues
 	 */
 	smp_wmb();
-	cic->key = cfqd_dead_key(cfqd);
+	cic->key = cfqg_dead_key(cfqg);
 
 	if (ioc->ioc_data == cic)
 		rcu_assign_pointer(ioc->ioc_data, NULL);
@@ -2696,23 +2752,23 @@ static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
 static void cfq_exit_single_io_context(struct io_context *ioc,
 				       struct cfq_io_context *cic)
 {
-	struct cfq_data *cfqd = cic_to_cfqd(cic);
+	struct cfq_group *cfqg = cic_to_cfqg(cic);
 
-	if (cfqd) {
-		struct request_queue *q = cfqd->queue;
+	if (cfqg) {
+		struct cfq_data *cfqd = cfqg->cfqd;
 		unsigned long flags;
 
-		spin_lock_irqsave(q->queue_lock, flags);
+		spin_lock_irqsave(&cfqg->lock, flags);
 
 		/*
 		 * Ensure we get a fresh copy of the ->key to prevent
 		 * race between exiting task and queue
 		 */
 		smp_read_barrier_depends();
-		if (cic->key == cfqd)
-			__cfq_exit_single_io_context(cfqd, cic);
+		if (cic->key == cfqg)
+			__cfq_exit_single_io_context(cfqd, cfqg, cic);
 
-		spin_unlock_irqrestore(q->queue_lock, flags);
+		spin_unlock_irqrestore(&cfqg->lock, flags);
 	}
 }
 
@@ -2734,7 +2790,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 							cfqd->queue->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
-		INIT_LIST_HEAD(&cic->queue_list);
+		INIT_LIST_HEAD(&cic->group_list);
 		INIT_HLIST_NODE(&cic->cic_list);
 		cic->dtor = cfq_free_io_context;
 		cic->exit = cfq_exit_io_context;
@@ -2801,8 +2857,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
-						GFP_ATOMIC);
+		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic, GFP_ATOMIC);
 		if (new_cfqq) {
 			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
 			cfq_put_queue(cfqq);
@@ -2879,16 +2934,14 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+		     struct cfq_io_context *cic, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
-	struct cfq_io_context *cic;
+	struct io_context *ioc = cic->ioc;
 	struct cfq_group *cfqg;
 
 retry:
-	cfqg = cfq_get_cfqg(cfqd, NULL, 1);
-	cic = cfq_cic_lookup(cfqd, ioc);
-	/* cic always exists here */
+	cfqg = cic_to_cfqg(cic);
 	cfqq = cic_to_cfqq(cic, is_sync);
 
 	/*
@@ -2930,36 +2983,38 @@ retry:
 }
 
 static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
+		return &cfqg->async_cfqq[0][ioprio];
 	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
+		return &cfqg->async_cfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
+		return &cfqg->async_idle_cfqq;
 	default:
 		BUG();
 	}
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
+cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_context *cic,
 	      gfp_t gfp_mask)
 {
+	struct io_context *ioc = cic->ioc;
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
+	struct cfq_group *cfqg = cic_to_cfqg(cic);
 	struct cfq_queue **async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
+		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
 		cfqq = *async_cfqq;
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, gfp_mask);
 
 	/*
 	 * pin the queue now that it's allocated, scheduler exit will prune it
@@ -2977,19 +3032,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
  * We drop cfq io contexts lazily, so we may find a dead one.
  */
 static void
-cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
-		  struct cfq_io_context *cic)
+cfq_drop_dead_cic(struct cfq_data *cfqd, struct cfq_group *cfqg,
+		  struct io_context *ioc, struct cfq_io_context *cic)
 {
 	unsigned long flags;
 
-	WARN_ON(!list_empty(&cic->queue_list));
-	BUG_ON(cic->key != cfqd_dead_key(cfqd));
+	WARN_ON(!list_empty(&cic->group_list));
+	BUG_ON(cic->key != cfqg_dead_key(cfqg));
 
 	spin_lock_irqsave(&ioc->lock, flags);
 
 	BUG_ON(ioc->ioc_data == cic);
 
-	radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
+	radix_tree_delete(&ioc->radix_root, cfqg->cic_index);
 	hlist_del_rcu(&cic->cic_list);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 
@@ -2997,11 +3052,14 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
 }
 
 static struct cfq_io_context *
-cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
+cfq_cic_lookup(struct cfq_data *cfqd, struct cfq_group *cfqg,
+	       struct io_context *ioc)
 {
 	struct cfq_io_context *cic;
 	unsigned long flags;
 
+	if (!cfqg)
+		return NULL;
 	if (unlikely(!ioc))
 		return NULL;
 
@@ -3011,18 +3069,18 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	 * we maintain a last-hit cache, to avoid browsing over the tree
 	 */
 	cic = rcu_dereference(ioc->ioc_data);
-	if (cic && cic->key == cfqd) {
+	if (cic && cic->key == cfqg) {
 		rcu_read_unlock();
 		return cic;
 	}
 
 	do {
-		cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index);
+		cic = radix_tree_lookup(&ioc->radix_root, cfqg->cic_index);
 		rcu_read_unlock();
 		if (!cic)
 			break;
-		if (unlikely(cic->key != cfqd)) {
-			cfq_drop_dead_cic(cfqd, ioc, cic);
+		if (unlikely(cic->key != cfqg)) {
+			cfq_drop_dead_cic(cfqd, cfqg, ioc, cic);
 			rcu_read_lock();
 			continue;
 		}
@@ -3037,24 +3095,25 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 }
 
 /*
- * Add cic into ioc, using cfqd as the search key. This enables us to lookup
+ * Add cic into ioc, using cfqg as the search key. This enables us to lookup
  * the process specific cfq io context when entered from the block layer.
- * Also adds the cic to a per-cfqd list, used when this queue is removed.
+ * Also adds the cic to a per-cfqg list, used when the group is removed.
+ * request_queue lock must be held.
  */
-static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
-			struct cfq_io_context *cic, gfp_t gfp_mask)
+static int cfq_cic_link(struct cfq_data *cfqd, struct cfq_group *cfqg,
+			struct io_context *ioc, struct cfq_io_context *cic)
 {
 	unsigned long flags;
 	int ret;
 
-	ret = radix_tree_preload(gfp_mask);
+	ret = radix_tree_preload(GFP_ATOMIC);
 	if (!ret) {
 		cic->ioc = ioc;
-		cic->key = cfqd;
+		cic->key = cfqg;
 
 		spin_lock_irqsave(&ioc->lock, flags);
 		ret = radix_tree_insert(&ioc->radix_root,
-						cfqd->cic_index, cic);
+						cfqg->cic_index, cic);
 		if (!ret)
 			hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
 		spin_unlock_irqrestore(&ioc->lock, flags);
@@ -3062,9 +3121,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(&cfqg->lock, flags);
+			list_add(&cic->group_list, &cfqg->cic_list);
+			spin_unlock_irqrestore(&cfqg->lock, flags);
 		}
 	}
 
@@ -3080,10 +3139,12 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
  * than one device managed by cfq.
  */
 static struct cfq_io_context *
-cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
+cfq_get_io_context(struct cfq_data *cfqd, struct bio *bio, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct cfq_group *cfqg, *cfqg2;
+	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -3091,18 +3152,38 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 	if (!ioc)
 		return NULL;
 
-	cic = cfq_cic_lookup(cfqd, ioc);
+	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+retry_cfqg:
+	cfqg = cfq_get_cfqg(cfqd, bio, 1);
+retry_cic:
+	cic = cfq_cic_lookup(cfqd, cfqg, ioc);
 	if (cic)
 		goto out;
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 
 	cic = cfq_alloc_io_context(cfqd, gfp_mask);
 	if (cic == NULL)
 		goto err;
 
-	if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
+	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+	/* check the consistency breakage during unlock period */
+	cfqg2 = cfq_get_cfqg(cfqd, bio, 0);
+	if (cfqg != cfqg2) {
+		cfq_cic_free(cic);
+		if (!cfqg2)
+			goto retry_cfqg;
+		else {
+			cfqg = cfqg2;
+			goto retry_cic;
+		}
+	}
+
+	if (cfq_cic_link(cfqd, cfqg, ioc, cic))
 		goto err_free;
 
 out:
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
@@ -3113,6 +3194,7 @@ out:
 #endif
 	return cic;
 err_free:
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 	cfq_cic_free(cic);
 err:
 	put_io_context(ioc);
@@ -3537,6 +3619,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_group *cfqg;
 	struct task_struct *tsk = current;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
@@ -3547,7 +3630,8 @@ static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
 	 * so just lookup a possibly existing queue, or return 'may queue'
 	 * if that fails
 	 */
-	cic = cfq_cic_lookup(cfqd, tsk->io_context);
+	cfqg = cfq_get_cfqg(cfqd, bio, 0);
+	cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
 	if (!cic)
 		return ELV_MQUEUE_MAY;
 
@@ -3636,7 +3720,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	cic = cfq_get_io_context(cfqd, gfp_mask);
+	cic = cfq_get_io_context(cfqd, bio, gfp_mask);
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
@@ -3646,7 +3730,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, is_sync, cic, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	} else {
 		/*
@@ -3762,19 +3846,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
 	cancel_work_sync(&cfqd->unplug_work);
 }
 
-static void cfq_put_async_queues(struct cfq_data *cfqd)
+static void cfq_put_async_queues(struct cfq_group *cfqg)
 {
 	int i;
 
 	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
+		if (cfqg->async_cfqq[0][i])
+			cfq_put_queue(cfqg->async_cfqq[0][i]);
+		if (cfqg->async_cfqq[1][i])
+			cfq_put_queue(cfqg->async_cfqq[1][i]);
 	}
 
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
+	if (cfqg->async_idle_cfqq)
+		cfq_put_queue(cfqg->async_idle_cfqq);
 }
 
 static void cfq_cfqd_free(struct rcu_head *head)
@@ -3794,15 +3878,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	if (cfqd->active_queue)
 		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
 
-	while (!list_empty(&cfqd->cic_list)) {
-		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
-							struct cfq_io_context,
-							queue_list);
-
-		__cfq_exit_single_io_context(cfqd, cic);
-	}
-
-	cfq_put_async_queues(cfqd);
 	cfq_release_cfq_groups(cfqd);
 	cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
 
@@ -3810,10 +3885,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 
 	cfq_shutdown_timer_wq(cfqd);
 
-	spin_lock(&cic_index_lock);
-	ida_remove(&cic_index_ida, cfqd->cic_index);
-	spin_unlock(&cic_index_lock);
-
 	/* Wait for cfqg->blkg->key accessors to exit their grace periods. */
 	call_rcu(&cfqd->rcu, cfq_cfqd_free);
 }
@@ -3823,7 +3894,7 @@ static int cfq_alloc_cic_index(void)
 	int index, error;
 
 	do {
-		if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
+		if (!ida_pre_get(&cic_index_ida, GFP_ATOMIC))
 			return -ENOMEM;
 
 		spin_lock(&cic_index_lock);
@@ -3839,20 +3910,18 @@ static int cfq_alloc_cic_index(void)
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i, j;
+	int i, j, idx;
 	struct cfq_group *cfqg;
 	struct cfq_rb_root *st;
 
-	i = cfq_alloc_cic_index();
-	if (i < 0)
+	idx = cfq_alloc_cic_index();
+	if (idx < 0)
 		return NULL;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	cfqd->cic_index = i;
-
 	/* Init root service tree */
 	cfqd->grp_service_tree = CFQ_RB_ROOT;
 
@@ -3861,6 +3930,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
 	RB_CLEAR_NODE(&cfqg->rb_node);
+	cfqg->cfqd = cfqd;
+	cfqg->cic_index = idx;
+	INIT_LIST_HEAD(&cfqg->cic_list);
 
 	/* Give preference to root group over other groups */
 	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
@@ -3874,6 +3946,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	rcu_read_lock();
 	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
 					(void *)cfqd, 0);
+	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 	rcu_read_unlock();
 #endif
 	/*
@@ -3893,8 +3966,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	atomic_inc(&cfqd->oom_cfqq.ref);
 	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
-	INIT_LIST_HEAD(&cfqd->cic_list);
-
 	cfqd->queue = q;
 
 	init_timer(&cfqd->idle_slice_timer);
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 64d5291..6c05b54 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -18,7 +18,7 @@ struct cfq_io_context {
 	unsigned long ttime_samples;
 	unsigned long ttime_mean;
 
-	struct list_head queue_list;
+	struct list_head group_list;
 	struct hlist_node cic_list;
 
 	void (*dtor)(struct io_context *); /* destructor */
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC][PATCH 11/11] blkiocg async: Workload timeslice adjustment for async queues
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (9 preceding siblings ...)
  2010-07-09  3:22 ` [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group Munehiro Ikeda
@ 2010-07-09  3:23 ` Munehiro Ikeda
  2010-07-09 10:04 ` [RFC][PATCH 00/11] blkiocg async support Andrea Righi
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09  3:23 UTC (permalink / raw)
  To: linux-kernel, jens.axboe, Vivek Goyal
  Cc: Munehiro Ikeda, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

Now async queues are not system-wide.  Workload timeslice was
calculated based on the assumption that async queues are system
wide.  This patch modifies it.

This is the only one modification for queue scheduling algorithm
by this patch series.

ToDo:
To investigate if more tuning is needed for non-system-wide
async queues.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/cfq-iosched.c |   56 ++++++++++++++++++++++++++++++++++----------------
 1 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4186c30..f930dfd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2165,6 +2165,41 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 	return cur_best;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED_ASYNC
+static unsigned int adjust_async_slice(struct cfq_data *cfqd,
+				struct cfq_group *cfqg, unsigned int slice)
+{
+	/* Just scaling down according to the sync/async slice ratio
+	 * if async queues are not system wide. */
+	return slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
+}
+
+#else /* CONFIG_GROUP_IOSCHED_ASYNC */
+
+static unsigned int adjust_async_slice(struct cfq_data *cfqd,
+				struct cfq_group *cfqg, unsigned int slice)
+{
+	unsigned int new_slice;
+
+	/*
+	 * If async queues are system wide, just taking
+	 * proportion of queues with-in same group will lead to higher
+	 * async ratio system wide as generally root group is going
+	 * to have higher weight. A more accurate thing would be to
+	 * calculate system wide asnc/sync ratio.
+	 */
+	new_slice = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
+	new_slice = new_slic/cfqd->busy_queues;
+	new_slice = min_t(unsigned, slice, new_slice);
+
+	/* async workload slice is scaled down according to
+	 * the sync/async slice ratio. */
+	new_slice = new_slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
+
+	return new_slice;
+}
+#endif /* CONFIG_GROUP_IOSCHED_ASYNC */
+
 static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
 	unsigned slice;
@@ -2220,24 +2255,9 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
 		      cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
 
-	if (cfqd->serving_type == ASYNC_WORKLOAD) {
-		unsigned int tmp;
-
-		/*
-		 * Async queues are currently system wide. Just taking
-		 * proportion of queues with-in same group will lead to higher
-		 * async ratio system wide as generally root group is going
-		 * to have higher weight. A more accurate thing would be to
-		 * calculate system wide asnc/sync ratio.
-		 */
-		tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
-		tmp = tmp/cfqd->busy_queues;
-		slice = min_t(unsigned, slice, tmp);
-
-		/* async workload slice is scaled down according to
-		 * the sync/async slice ratio. */
-		slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
-	} else
+	if (cfqd->serving_type == ASYNC_WORKLOAD)
+		slice = adjust_async_slice(cfqd, cfqg, slice);
+	else
 		/* sync workload slice is at least 2 * cfq_slice_idle */
 		slice = max(slice, 2 * cfqd->cfq_slice_idle);
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09  3:15 ` [RFC][PATCH 02/11] blkiocg async: The main part of iotrack Munehiro Ikeda
@ 2010-07-09  7:35   ` KAMEZAWA Hiroyuki
  2010-07-09 23:06     ` Munehiro Ikeda
  2010-07-09  7:38   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-09  7:35 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Thu, 08 Jul 2010 23:15:30 -0400
Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:

> iotrack is a functionality to record who dirtied the
> page.  This is needed for block IO controller cgroup
> to support async (cached) write.
> 
> This patch is based on a patch posted from Ryo Tsuruta
> on Oct 2, 2009 titled as "The body of blkio-cgroup".
> The patch added a new member on struct page_cgroup to
> record cgroup ID, but this was given a negative opinion
> from Kame, a maintainer of memory controller cgroup,
> because this bloats the size of struct page_cgroup.
> 
> Instead, this patch takes an approach proposed by
> Andrea Righi, which records cgroup ID in flags
> of struct page_cgroup with bit encoding.
> 
> ToDo:
> Cgroup ID of deleted cgroup will be recycled.  Further
> consideration is needed.
> 
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/Kconfig.iosched       |    8 +++
>  block/Makefile              |    1 +
>  block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
>  include/linux/page_cgroup.h |   25 ++++++++
>  init/Kconfig                |    2 +-
>  mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
>  7 files changed, 309 insertions(+), 9 deletions(-)
>  create mode 100644 block/blk-iotrack.c
>  create mode 100644 include/linux/blk-iotrack.h
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 3199b76..3ab712d 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
>  	---help---
>  	  Enable group IO scheduling in CFQ.
>  
> +config GROUP_IOSCHED_ASYNC
> +	bool "CFQ Group Scheduling for async IOs (EXPERIMENTAL)"
> +	depends on CFQ_GROUP_IOSCHED && EXPERIMENTAL
> +	select MM_OWNER
> +	default n
> +	help
> +	  Enable group IO scheduling for async IOs.
> +
>  choice
>  	prompt "Default I/O scheduler"
>  	default DEFAULT_CFQ
> diff --git a/block/Makefile b/block/Makefile
> index 0bb499a..441858d 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>  
>  obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
>  obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
> +obj-$(CONFIG_GROUP_IOSCHED_ASYNC) += blk-iotrack.o
>  obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> diff --git a/block/blk-iotrack.c b/block/blk-iotrack.c
> new file mode 100644
> index 0000000..d98a09a
> --- /dev/null
> +++ b/block/blk-iotrack.c
> @@ -0,0 +1,129 @@
> +/* blk-iotrack.c - Block I/O Tracking
> + *
> + * Copyright (C) VA Linux Systems Japan, 2008-2009
> + * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
> + *
> + * Copyright (C) 2010 Munehiro Ikeda <m-ikeda@ds.jp.nec.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mm_inline.h>
> +#include <linux/rcupdate.h>
> +#include <linux/module.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-iotrack.h>
> +#include "blk-cgroup.h"
> +
> +/*
> + * The block I/O tracking mechanism is implemented on the cgroup memory
> + * controller framework. It helps to find the the owner of an I/O request
> + * because every I/O request has a target page and the owner of the page
> + * can be easily determined on the framework.
> + */
> +
> +/* Return the blkio_cgroup that associates with a process. */
> +static inline struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *p)
> +{
> +	return cgroup_to_blkio_cgroup(task_cgroup(p, blkio_subsys_id));
> +}
> +
> +/**
> + * blk_iotrack_set_owner() - set the owner ID of a page.
> + * @page:	the page we want to tag
> + * @mm:		the mm_struct of a page owner
> + *
> + * Make a given page have the blkio-cgroup ID of the owner of this page.
> + */
> +int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm)
> +{
> +	struct blkio_cgroup *blkcg;
> +	unsigned short id = 0;	/* 0: default blkio_cgroup id */
> +
> +	if (blk_iotrack_disabled())
> +		return 0;
> +	if (!mm)
> +		goto out;
> +
> +	rcu_read_lock();
> +	blkcg = task_to_blkio_cgroup(rcu_dereference(mm->owner));
> +	if (likely(blkcg))
> +		id = css_id(&blkcg->css);
> +	rcu_read_unlock();
> +out:
> +	return page_cgroup_set_owner(page, id);
> +}
> +
> +/**
> + * blk_iotrack_reset_owner() - reset the owner ID of a page
> + * @page:	the page we want to tag
> + * @mm:		the mm_struct of a page owner
> + *
> + * Change the owner of a given page if necessary.
> + */
> +int blk_iotrack_reset_owner(struct page *page, struct mm_struct *mm)
> +{
> +	return blk_iotrack_set_owner(page, mm);
> +}
> +
> +/**
> + * blk_iotrack_reset_owner_pagedirty() - reset the owner ID of a pagecache page
> + * @page:	the page we want to tag
> + * @mm:		the mm_struct of a page owner
> + *
> + * Change the owner of a given page if the page is in the pagecache.
> + */
> +int blk_iotrack_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> +{
> +	if (!page_is_file_cache(page))
> +		return 0;
> +	if (current->flags & PF_MEMALLOC)
> +		return 0;
> +
> +	return blk_iotrack_reset_owner(page, mm);
> +}
> +
> +/**
> + * blk_iotrack_copy_owner() - copy the owner ID of a page into another page
> + * @npage:	the page where we want to copy the owner
> + * @opage:	the page from which we want to copy the ID
> + *
> + * Copy the owner ID of @opage into @npage.
> + */
> +int blk_iotrack_copy_owner(struct page *npage, struct page *opage)
> +{
> +	if (blk_iotrack_disabled())
> +		return 0;
> +	return page_cgroup_copy_owner(npage, opage);
> +}
> +
> +/**
> + * blk_iotrack_cgroup_id() - determine the blkio-cgroup ID
> + * @bio:	the &struct bio which describes the I/O
> + *
> + * Returns the blkio-cgroup ID of a given bio. A return value zero
> + * means that the page associated with the bio belongs to root blkio_cgroup.
> + */
> +unsigned long blk_iotrack_cgroup_id(struct bio *bio)
> +{
> +	struct page *page;
> +
> +	if (!bio->bi_vcnt)
> +		return 0;
> +
> +	page = bio_iovec_idx(bio, 0)->bv_page;
> +	return page_cgroup_get_owner(page);
> +}
> +EXPORT_SYMBOL(blk_iotrack_cgroup_id);
> +
> diff --git a/include/linux/blk-iotrack.h b/include/linux/blk-iotrack.h
> new file mode 100644
> index 0000000..8021c2b
> --- /dev/null
> +++ b/include/linux/blk-iotrack.h
> @@ -0,0 +1,62 @@
> +#include <linux/cgroup.h>
> +#include <linux/mm.h>
> +#include <linux/page_cgroup.h>
> +
> +#ifndef _LINUX_BLK_IOTRACK_H
> +#define _LINUX_BLK_IOTRACK_H
> +
> +#ifdef CONFIG_GROUP_IOSCHED_ASYNC
> +
> +/**
> + * blk_iotrack_disabled() - check whether block IO tracking is disabled
> + * Returns true if disabled, false if not.
> + */
> +static inline bool blk_iotrack_disabled(void)
> +{
> +	if (blkio_subsys.disabled)
> +		return true;
> +	return false;
> +}
> +
> +extern int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm);
> +extern int blk_iotrack_reset_owner(struct page *page, struct mm_struct *mm);
> +extern int blk_iotrack_reset_owner_pagedirty(struct page *page,
> +						 struct mm_struct *mm);
> +extern int blk_iotrack_copy_owner(struct page *page, struct page *opage);
> +extern unsigned long blk_iotrack_cgroup_id(struct bio *bio);
> +
> +#else /* !CONFIG_GROUP_IOSCHED_ASYNC */
> +
> +static inline bool blk_iotrack_disabled(void)
> +{
> +	return true;
> +}
> +
> +static inline int blk_iotrack_set_owner(struct page *page,
> +						struct mm_struct *mm)
> +{
> +}
> +
> +static inline int blk_iotrack_reset_owner(struct page *page,
> +						struct mm_struct *mm)
> +{
> +}
> +
> +static inline int blk_iotrack_reset_owner_pagedirty(struct page *page,
> +						struct mm_struct *mm)
> +{
> +}
> +
> +static inline int blk_iotrack_copy_owner(struct page *page,
> +						struct page *opage)
> +{
> +}
> +
> +static inline unsigned long blk_iotrack_cgroup_id(struct bio *bio)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_GROUP_IOSCHED_ASYNC */
> +
> +#endif /* _LINUX_BLK_IOTRACK_H */
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 6a21b0d..473b79a 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -17,6 +17,31 @@ struct page_cgroup {
>  	struct list_head lru;		/* per cgroup LRU list */
>  };
>  
> +/*
> + * use lower 16 bits for flags and reserve the rest for the page tracking id
> + */
> +#define PAGE_TRACKING_ID_SHIFT	(16)
> +#define PAGE_TRACKING_ID_BITS \
> +		(8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
> +
> +/* NOTE: must be called with page_cgroup() lock held */
> +static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
> +{
> +	return pc->flags >> PAGE_TRACKING_ID_SHIFT;
> +}
> +
> +/* NOTE: must be called with page_cgroup() lock held */
> +static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
> +{
> +	WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
> +	pc->flags &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
> +	pc->flags |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
> +}

I think that this is the 1st wall.
Because ->flags is a "bit field" there are some places set/clear bit without
locks. (please see mem_cgroup_del/add_lru())

Then, I can't say this field operation is safe even if it's always done
under locks. hmm. Can this be done by cmpxchg ?
IIUC, there is an generic version now even if it's heavy.





> +
> +unsigned long page_cgroup_get_owner(struct page *page);
> +int page_cgroup_set_owner(struct page *page, unsigned long id);
> +int page_cgroup_copy_owner(struct page *npage, struct page *opage);
> +
>  void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
>  
>  #ifdef CONFIG_SPARSEMEM
> diff --git a/init/Kconfig b/init/Kconfig
> index 2e40f2f..337ee01 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -650,7 +650,7 @@ endif # CGROUPS
>  
>  config CGROUP_PAGE
>  	def_bool y
> -	depends on CGROUP_MEM_RES_CTLR
> +	depends on CGROUP_MEM_RES_CTLR || GROUP_IOSCHED_ASYNC
>  
>  config MM_OWNER
>  	bool
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 6c00814..69e080c 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -9,6 +9,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/cgroup.h>
>  #include <linux/swapops.h>
> +#include <linux/blk-iotrack.h>
>  
>  static void __meminit
>  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
>  
>  	int nid, fail;
>  
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() && blk_iotrack_disabled())
>  		return;
>  
>  	for_each_online_node(nid)  {
> @@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
>  			goto fail;
>  	}
>  	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
> -	" don't want memory cgroups\n");
> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
> +	" if you don't want memory and blkio cgroups\n");
>  	return;
>  fail:
>  	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
> +	printk(KERN_CRIT
> +		"please try 'cgroup_disable=memory,blkio' boot option\n");
>  	panic("Out of memory");
>  }

Hmm, io-track is always done if blkio-cgroup is used, Right ?

Then, why you have extra config as CONFIG_GROUP_IOSCHED_ASYNC ?
Is it necessary ?

>  
> @@ -251,7 +253,7 @@ void __init page_cgroup_init(void)
>  	unsigned long pfn;
>  	int fail = 0;
>  
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() && blk_iotrack_disabled())
>  		return;
>  
>  	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> @@ -260,14 +262,15 @@ void __init page_cgroup_init(void)
>  		fail = init_section_page_cgroup(pfn);
>  	}
>  	if (fail) {
> -		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> +		printk(KERN_CRIT
> +			"try 'cgroup_disable=memory,blkio' boot option\n");
>  		panic("Out of memory");
>  	} else {
>  		hotplug_memory_notifier(page_cgroup_callback, 0);
>  	}
>  	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
> -	" want memory cgroups\n");
> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
> +	" if you don't want memory and blkio cgroups\n");
>  }
>  
>  void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> @@ -277,6 +280,78 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>  
>  #endif
>  
> +/**
> + * page_cgroup_get_owner() - get the owner ID of a page
> + * @page:	the page we want to find the owner
> + *
> + * Returns the owner ID of the page, 0 means that the owner cannot be
> + * retrieved.
> + **/
> +unsigned long page_cgroup_get_owner(struct page *page)
> +{
> +	struct page_cgroup *pc;
> +	unsigned long ret;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (unlikely(!pc))
> +		return 0;
> +
> +	lock_page_cgroup(pc);
> +	ret = page_cgroup_get_id(pc);
> +	unlock_page_cgroup(pc);
> +	return ret;
> +}

Is this lock required ? Even if wrong ID is got by race, it just means
this I/O is charged to other cgroup.
I don't think it's an issue. Considering the user can move tasks while generating
I/O without any interaction with I/O cgroup, the issue itself is invisible and
cannot be handled. I love light wegiht here.





> +
> +/**
> + * page_cgroup_set_owner() - set the owner ID of a page
> + * @page:	the page we want to tag
> + * @id:		the ID number that will be associated to page
> + *
> + * Returns 0 if the owner is correctly associated to the page. Returns a
> + * negative value in case of failure.
> + **/
> +int page_cgroup_set_owner(struct page *page, unsigned long id)
> +{
> +	struct page_cgroup *pc;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (unlikely(!pc))
> +		return -ENOENT;
> +
> +	lock_page_cgroup(pc);
> +	page_cgroup_set_id(pc, id);
> +	unlock_page_cgroup(pc);
> +	return 0;
> +}
> +

Doesn't this function check overwrite ?




> +/**
> + * page_cgroup_copy_owner() - copy the owner ID of a page into another page
> + * @npage:	the page where we want to copy the owner
> + * @opage:	the page from which we want to copy the ID
> + *
> + * Returns 0 if the owner is correctly associated to npage. Returns a negative
> + * value in case of failure.
> + **/
> +int page_cgroup_copy_owner(struct page *npage, struct page *opage)
> +{
> +	struct page_cgroup *npc, *opc;
> +	unsigned long id;
> +
> +	npc = lookup_page_cgroup(npage);
> +	if (unlikely(!npc))
> +		return -ENOENT;
> +	opc = lookup_page_cgroup(opage);
> +	if (unlikely(!opc))
> +		return -ENOENT;
> +	lock_page_cgroup(opc);
> +	lock_page_cgroup(npc);
> +	id = page_cgroup_get_id(opc);
> +	page_cgroup_set_id(npc, id);
> +	unlock_page_cgroup(npc);
> +	unlock_page_cgroup(opc);
> +
> +	return 0;
> +}

you can remove lock here, too. (use cmpxchg for writer.)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09  3:15 ` [RFC][PATCH 02/11] blkiocg async: The main part of iotrack Munehiro Ikeda
  2010-07-09  7:35   ` KAMEZAWA Hiroyuki
@ 2010-07-09  7:38   ` KAMEZAWA Hiroyuki
  2010-07-09 23:09     ` Munehiro Ikeda
  1 sibling, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-09  7:38 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Thu, 08 Jul 2010 23:15:30 -0400
Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:

> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/Kconfig.iosched       |    8 +++
>  block/Makefile              |    1 +
>  block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
>  include/linux/page_cgroup.h |   25 ++++++++
>  init/Kconfig                |    2 +-
>  mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
>  7 files changed, 309 insertions(+), 9 deletions(-)
>  create mode 100644 block/blk-iotrack.c
>  create mode 100644 include/linux/blk-iotrack.h
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 3199b76..3ab712d 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
>  	---help---
>  	  Enable group IO scheduling in CFQ.
>  
> +config GROUP_IOSCHED_ASYNC
> +	bool "CFQ Group Scheduling for async IOs (EXPERIMENTAL)"
> +	depends on CFQ_GROUP_IOSCHED && EXPERIMENTAL
> +	select MM_OWNER
> +	default n
> +	help
> +	  Enable group IO scheduling for async IOs.
> +
>  choice
>  	prompt "Default I/O scheduler"
>  	default DEFAULT_CFQ
> diff --git a/block/Makefile b/block/Makefile
> index 0bb499a..441858d 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>  
>  obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
>  obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
> +obj-$(CONFIG_GROUP_IOSCHED_ASYNC) += blk-iotrack.o
>  obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> diff --git a/block/blk-iotrack.c b/block/blk-iotrack.c
> new file mode 100644
> index 0000000..d98a09a
> --- /dev/null
> +++ b/block/blk-iotrack.c
> @@ -0,0 +1,129 @@
> +/* blk-iotrack.c - Block I/O Tracking
> + *
> + * Copyright (C) VA Linux Systems Japan, 2008-2009
> + * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
> + *
> + * Copyright (C) 2010 Munehiro Ikeda <m-ikeda@ds.jp.nec.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mm_inline.h>
> +#include <linux/rcupdate.h>
> +#include <linux/module.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-iotrack.h>
> +#include "blk-cgroup.h"
> +
> +/*
> + * The block I/O tracking mechanism is implemented on the cgroup memory
> + * controller framework. It helps to find the the owner of an I/O request
> + * because every I/O request has a target page and the owner of the page
> + * can be easily determined on the framework.
> + */
> +
> +/* Return the blkio_cgroup that associates with a process. */
> +static inline struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *p)
> +{
> +	return cgroup_to_blkio_cgroup(task_cgroup(p, blkio_subsys_id));
> +}
> +
> +/**
> + * blk_iotrack_set_owner() - set the owner ID of a page.
> + * @page:	the page we want to tag
> + * @mm:		the mm_struct of a page owner
> + *
> + * Make a given page have the blkio-cgroup ID of the owner of this page.
> + */
> +int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm)
> +{
> +	struct blkio_cgroup *blkcg;
> +	unsigned short id = 0;	/* 0: default blkio_cgroup id */
> +
> +	if (blk_iotrack_disabled())
> +		return 0;
> +	if (!mm)
> +		goto out;
> +
> +	rcu_read_lock();
> +	blkcg = task_to_blkio_cgroup(rcu_dereference(mm->owner));
> +	if (likely(blkcg))
> +		id = css_id(&blkcg->css);
> +	rcu_read_unlock();
> +out:
> +	return page_cgroup_set_owner(page, id);
> +}
> +

I think this is bad. I/O should be charged against threads i.e. "current",
because CFQ/blockio-cgroup provides per-thread control.
mm->owner is not suitable, I think.

Thank,s
-Kame



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack
  2010-07-09  3:16 ` [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack Munehiro Ikeda
@ 2010-07-09  9:24   ` Andrea Righi
  2010-07-09 23:43     ` Munehiro Ikeda
  0 siblings, 1 reply; 53+ messages in thread
From: Andrea Righi @ 2010-07-09  9:24 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Gui Jianfeng, akpm, balbir

On Thu, Jul 08, 2010 at 11:16:28PM -0400, Munehiro Ikeda wrote:
> Embedding hooks for iotrack to record process info, namely
> cgroup ID.
> This patch is based on a patch posted from Ryo Tsuruta on Oct 2,
> 2009 titled "Page tracking hooks".
> 
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  fs/buffer.c         |    2 ++
>  fs/direct-io.c      |    2 ++
>  mm/bounce.c         |    2 ++
>  mm/filemap.c        |    2 ++
>  mm/memory.c         |    5 +++++
>  mm/page-writeback.c |    2 ++
>  mm/swap_state.c     |    2 ++
>  7 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d54812b..c418fdf 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -36,6 +36,7 @@
>  #include <linux/buffer_head.h>
>  #include <linux/task_io_accounting_ops.h>
>  #include <linux/bio.h>
> +#include <linux/blk-iotrack.h>
>  #include <linux/notifier.h>
>  #include <linux/cpu.h>
>  #include <linux/bitops.h>
> @@ -667,6 +668,7 @@ static void __set_page_dirty(struct page *page,
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
>  		account_page_dirtied(page, mapping);
> +		blk_iotrack_reset_owner_pagedirty(page, current->mm);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 7600aac..2c1f42f 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -33,6 +33,7 @@
>  #include <linux/err.h>
>  #include <linux/blkdev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/blk-iotrack.h>
>  #include <linux/rwsem.h>
>  #include <linux/uio.h>
>  #include <asm/atomic.h>
> @@ -846,6 +847,7 @@ static int do_direct_IO(struct dio *dio)
>  			ret = PTR_ERR(page);
>  			goto out;
>  		}
> +		blk_iotrack_reset_owner(page, current->mm);
>  
>  		while (block_in_page < blocks_per_page) {
>  			unsigned offset_in_page = block_in_page << blkbits;
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 13b6dad..04339df 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/hash.h>
>  #include <linux/highmem.h>
> +#include <linux/blk-iotrack.h>
>  #include <asm/tlbflush.h>
>  
>  #include <trace/events/block.h>
> @@ -211,6 +212,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
>  		to->bv_len = from->bv_len;
>  		to->bv_offset = from->bv_offset;
>  		inc_zone_page_state(to->bv_page, NR_BOUNCE);
> +		blk_iotrack_copy_owner(to->bv_page, page);
>  
>  		if (rw == WRITE) {
>  			char *vto, *vfrom;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 20e5642..a255d0c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
>  #include <linux/cpuset.h>
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
> +#include <linux/blk-iotrack.h>
>  #include <linux/mm_inline.h> /* for page_is_file_cache() */
>  #include "internal.h"
>  
> @@ -405,6 +406,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  					gfp_mask & GFP_RECLAIM_MASK);
>  	if (error)
>  		goto out;
> +	blk_iotrack_set_owner(page, current->mm);
>  
>  	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error == 0) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 119b7cc..3eb2d0d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -52,6 +52,7 @@
>  #include <linux/init.h>
>  #include <linux/writeback.h>
>  #include <linux/memcontrol.h>
> +#include <linux/blk-iotrack.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/kallsyms.h>
>  #include <linux/swapops.h>
> @@ -2283,6 +2284,7 @@ gotten:
>  		 */
>  		ptep_clear_flush(vma, address, page_table);
>  		page_add_new_anon_rmap(new_page, vma, address);
> +		blk_iotrack_set_owner(new_page, mm);
>  		/*
>  		 * We call the notify macro here because, when using secondary
>  		 * mmu page tables (such as kvm shadow page tables), we want the
> @@ -2718,6 +2720,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> +	blk_iotrack_reset_owner(page, mm);
>  	/* It's better to call commit-charge after rmap is established */
>  	mem_cgroup_commit_charge_swapin(page, ptr);
>  
> @@ -2795,6 +2798,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	page_add_new_anon_rmap(page, vma, address);
> +	blk_iotrack_set_owner(page, mm);

We're tracking anonymous memory here. Should we charge the cost of the
swap IO to the root cgroup or the actual owner of the page? there was a
previous discussion about this topic, but can't remember the details,
sorry.

IMHO we could remove some overhead simply ignoring the tracking of
anonymous pages (swap IO) and just consider direct and writeback IO of
file pages.

>  setpte:
>  	set_pte_at(mm, address, page_table, entry);
>  
> @@ -2949,6 +2953,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (anon) {
>  			inc_mm_counter_fast(mm, MM_ANONPAGES);
>  			page_add_new_anon_rmap(page, vma, address);
> +			blk_iotrack_set_owner(page, mm);

Ditto.

>  		} else {
>  			inc_mm_counter_fast(mm, MM_FILEPAGES);
>  			page_add_file_rmap(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 54f28bd..f3e6b2c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -23,6 +23,7 @@
>  #include <linux/init.h>
>  #include <linux/backing-dev.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/blk-iotrack.h>
>  #include <linux/blkdev.h>
>  #include <linux/mpage.h>
>  #include <linux/rmap.h>
> @@ -1128,6 +1129,7 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			BUG_ON(mapping2 != mapping);
>  			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
>  			account_page_dirtied(page, mapping);
> +			blk_iotrack_reset_owner_pagedirty(page, current->mm);
>  			radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  		}
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e10f583..ab26978 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -19,6 +19,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
>  #include <linux/page_cgroup.h>
> +#include <linux/blk-iotrack.h>
>  
>  #include <asm/pgtable.h>
>  
> @@ -324,6 +325,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
>  		__set_page_locked(new_page);
>  		SetPageSwapBacked(new_page);
> +		blk_iotrack_set_owner(new_page, current->mm);

Ditto.

>  		err = __add_to_swap_cache(new_page, entry);
>  		if (likely(!err)) {
>  			radix_tree_preload_end();
> -- 
> 1.6.2.5

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (10 preceding siblings ...)
  2010-07-09  3:23 ` [RFC][PATCH 11/11] blkiocg async: Workload timeslice adjustment for async queues Munehiro Ikeda
@ 2010-07-09 10:04 ` Andrea Righi
  2010-07-09 13:45 ` Vivek Goyal
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 53+ messages in thread
From: Andrea Righi @ 2010-07-09 10:04 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Gui Jianfeng, akpm, balbir, Greg Thelen

On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> This might be a piece of puzzle for complete async write support of blkio
> controller.  One of other pieces in my head is page dirtying ratio control.
> I believe Andrea Righi was working on it...how about the situation?

Greg Thelen (cc'ed) did some progresses on my original work. AFAIK
there're still some locking issue to resolve, principally because
cgroup dirty memory accounting requires lock_page_cgroup() to be
irq-safe. I did some tests using the irq-safe locking vs trylock
approach and Greg also tested the RCU way.

The RCU approach seems promising IMHO, because the page's cgroup owner
is supposed to change rarely (except for the files shared and frequently
written by many cgroups).

Greg, do you have a patch rebased to a recent kernel?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (11 preceding siblings ...)
  2010-07-09 10:04 ` [RFC][PATCH 00/11] blkiocg async support Andrea Righi
@ 2010-07-09 13:45 ` Vivek Goyal
  2010-07-10  0:17   ` Munehiro Ikeda
  2010-07-26  6:41 ` Balbir Singh
  2010-08-02 20:58 ` Vivek Goyal
  14 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-09 13:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> These RFC patches are trial to add async (cached) write support on blkio
> controller.
> 
> Only test which has been done is to compile, boot, and that write bandwidth
> seems prioritized when pages which were dirtied by two different processes in
> different cgroups are written back to a device simultaneously.  I know this
> is the minimum (or less) test but I posted this as RFC because I would like
> to hear your opinions about the design direction in the early stage.
> 
> Patches are for 2.6.35-rc4.
> 
> This patch series consists of two chunks.
> 
> (1) iotrack (patch 01/11 -- 06/11)
> 
> This is a functionality to track who dirtied a page, in exact which cgroup a
> process which dirtied a page belongs to.  Blkio controller will read the info
> later and prioritize when the page is actually written to a block device.
> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
> proposals for IO controller.
> 
> 
> (2) blkio controller modification (07/11 -- 11/11)
> 
> The main part of blkio controller async write support.
> Currently async queues are device-wide and async write IOs are always treated
> as root group.
> These patches make async queues per a cfq_group per a device to control them.
> Async write is handled by flush kernel thread.  Because queue pointers are
> stored in cfq_io_context, io_context of the thread has to have multiple
> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
> io_context per a cfq_group, which means per an io_context per a cgroup per a
> device.
> 
> 
> This might be a piece of puzzle for complete async write support of blkio
> controller.  One of other pieces in my head is page dirtying ratio control.
> I believe Andrea Righi was working on it...how about the situation?

Thanks Muuh. I will look into the patches in detail. 

In my initial patches I had implemented the support for ASYNC control
(also included Ryo's IO tracking patches) but it did not work well and
it was unpredictable. I realized that until and unless we implement
some kind of per group dirty ratio/page cache share at VM level and
create parallel paths for ASYNC IO, writes often get serialized.

So writes belonging to high priority group get stuck behind low priority
group and you don't get any service differentiation.

So IMHO, this piece should go into kernel after we have first fixed the
problem at VM (read memory controller) with per cgroup dirty ratio kind
of thing.

> 
> And also, I'm thinking that async write support is required by bandwidth
> capping policy of blkio controller.  Bandwidth capping can be done in upper
> layer than elevator.

I think capping facility we should implement in higher layers otherwise
it is not useful for higher level logical devices (dm/md).

It was ok to implement proportional bandwidth division at CFQ level
because one can do proportional BW division at each leaf node and still get
overall service differentation at higher level logical node. But same can
not be done for max BW control.
 
>  However I think it should be also done in elevator layer
> in my opinion.  Elevator buffers and sort requests.  If there is another
> buffering functionality in upper layer, it is doubled buffering and it can be
> harmful for elevator's prediction.

I don't mind doing it at elevator layer also because in that case of
somebody is not using dm/md, then one does not have to load max bw
control module and one can simply enable max bw control in CFQ. 

Thinking more about it, now we are suggesting implementing max BW
control at two places. I think it will be duplication of code and
increased complexity in CFQ. Probably implement max bw control with
the help of dm module and use same for CFQ also. There is pain 
associated with configuring dm device but I guess it is easier than
maintaining two max bw control schemes in kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09  7:35   ` KAMEZAWA Hiroyuki
@ 2010-07-09 23:06     ` Munehiro Ikeda
  2010-07-12  0:11       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09 23:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, Andrea Righi,
	Gui Jianfeng, akpm, balbir

KAMEZAWA Hiroyuki wrote, on 07/09/2010 03:35 AM:
> On Thu, 08 Jul 2010 23:15:30 -0400
> Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
>
>> iotrack is a functionality to record who dirtied the
>> page.  This is needed for block IO controller cgroup
>> to support async (cached) write.
>>
>> This patch is based on a patch posted from Ryo Tsuruta
>> on Oct 2, 2009 titled as "The body of blkio-cgroup".
>> The patch added a new member on struct page_cgroup to
>> record cgroup ID, but this was given a negative opinion
>> from Kame, a maintainer of memory controller cgroup,
>> because this bloats the size of struct page_cgroup.
>>
>> Instead, this patch takes an approach proposed by
>> Andrea Righi, which records cgroup ID in flags
>> of struct page_cgroup with bit encoding.
>>
>> ToDo:
>> Cgroup ID of deleted cgroup will be recycled.  Further
>> consideration is needed.
>>
>> Signed-off-by: Hirokazu Takahashi<taka@valinux.co.jp>
>> Signed-off-by: Ryo Tsuruta<ryov@valinux.co.jp>
>> Signed-off-by: Andrea Righi<arighi@develer.com>
>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>> ---
>>   block/Kconfig.iosched       |    8 +++
>>   block/Makefile              |    1 +
>>   block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
>>   include/linux/page_cgroup.h |   25 ++++++++
>>   init/Kconfig                |    2 +-
>>   mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
>>   7 files changed, 309 insertions(+), 9 deletions(-)
>>   create mode 100644 block/blk-iotrack.c
>>   create mode 100644 include/linux/blk-iotrack.h

(snip, snip)

>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index 6a21b0d..473b79a 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -17,6 +17,31 @@ struct page_cgroup {
>>   	struct list_head lru;		/* per cgroup LRU list */
>>   };
>>
>> +/*
>> + * use lower 16 bits for flags and reserve the rest for the page tracking id
>> + */
>> +#define PAGE_TRACKING_ID_SHIFT	(16)
>> +#define PAGE_TRACKING_ID_BITS \
>> +		(8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
>> +
>> +/* NOTE: must be called with page_cgroup() lock held */
>> +static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
>> +{
>> +	return pc->flags>>  PAGE_TRACKING_ID_SHIFT;
>> +}
>> +
>> +/* NOTE: must be called with page_cgroup() lock held */
>> +static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
>> +{
>> +	WARN_ON(id>= (1UL<<  PAGE_TRACKING_ID_BITS));
>> +	pc->flags&= (1UL<<  PAGE_TRACKING_ID_SHIFT) - 1;
>> +	pc->flags |= (unsigned long)(id<<  PAGE_TRACKING_ID_SHIFT);
>> +}
>
> I think that this is the 1st wall.
> Because ->flags is a "bit field" there are some places set/clear bit without
> locks. (please see mem_cgroup_del/add_lru())
>
> Then, I can't say this field operation is safe even if it's always done
> under locks. hmm. Can this be done by cmpxchg ?

OK, we can do it like:

do {
	old =  pc->flags;
	new = old & ((1UL << PAGE_TRACKING_ID_SHIFT) - 1);
	new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
} while (cmpxchg(&pc->flags, old, new)  != old);


> IIUC, there is an generic version now even if it's heavy.

I couldn't find it out...is there already?  Or you mean we should
have generic one?


>> +unsigned long page_cgroup_get_owner(struct page *page);
>> +int page_cgroup_set_owner(struct page *page, unsigned long id);
>> +int page_cgroup_copy_owner(struct page *npage, struct page *opage);
>> +
>>   void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
>>
>>   #ifdef CONFIG_SPARSEMEM
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 2e40f2f..337ee01 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -650,7 +650,7 @@ endif # CGROUPS
>>
>>   config CGROUP_PAGE
>>   	def_bool y
>> -	depends on CGROUP_MEM_RES_CTLR
>> +	depends on CGROUP_MEM_RES_CTLR || GROUP_IOSCHED_ASYNC
>>
>>   config MM_OWNER
>>   	bool
>> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
>> index 6c00814..69e080c 100644
>> --- a/mm/page_cgroup.c
>> +++ b/mm/page_cgroup.c
>> @@ -9,6 +9,7 @@
>>   #include<linux/vmalloc.h>
>>   #include<linux/cgroup.h>
>>   #include<linux/swapops.h>
>> +#include<linux/blk-iotrack.h>
>>
>>   static void __meminit
>>   __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
>> @@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
>>
>>   	int nid, fail;
>>
>> -	if (mem_cgroup_disabled())
>> +	if (mem_cgroup_disabled()&&  blk_iotrack_disabled())
>>   		return;
>>
>>   	for_each_online_node(nid)  {
>> @@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
>>   			goto fail;
>>   	}
>>   	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
>> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
>> -	" don't want memory cgroups\n");
>> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
>> +	" if you don't want memory and blkio cgroups\n");
>>   	return;
>>   fail:
>>   	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
>> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
>> +	printk(KERN_CRIT
>> +		"please try 'cgroup_disable=memory,blkio' boot option\n");
>>   	panic("Out of memory");
>>   }
>
> Hmm, io-track is always done if blkio-cgroup is used, Right ?

No, iotrack is enabled only when CONFIG_GROUP_IOSCHED_ASYNC=y.
If =n, iotrack is disabled even if blkio-cgroup is enabled.


> Then, why you have extra config as CONFIG_GROUP_IOSCHED_ASYNC ?
> Is it necessary ?

Current purpose of the option is only for debug because it is
experimental functionality.
It can be removed if this work is completed, or dynamic switch
might be useful.

In fact, just "cgroup_disable=memory" is enough for the failure
case.  Let me think right messages.


>> @@ -251,7 +253,7 @@ void __init page_cgroup_init(void)
>>   	unsigned long pfn;
>>   	int fail = 0;
>>
>> -	if (mem_cgroup_disabled())
>> +	if (mem_cgroup_disabled()&&  blk_iotrack_disabled())
>>   		return;
>>
>>   	for (pfn = 0; !fail&&  pfn<  max_pfn; pfn += PAGES_PER_SECTION) {
>> @@ -260,14 +262,15 @@ void __init page_cgroup_init(void)
>>   		fail = init_section_page_cgroup(pfn);
>>   	}
>>   	if (fail) {
>> -		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
>> +		printk(KERN_CRIT
>> +			"try 'cgroup_disable=memory,blkio' boot option\n");
>>   		panic("Out of memory");
>>   	} else {
>>   		hotplug_memory_notifier(page_cgroup_callback, 0);
>>   	}
>>   	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
>> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
>> -	" want memory cgroups\n");
>> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
>> +	" if you don't want memory and blkio cgroups\n");
>>   }
>>
>>   void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>> @@ -277,6 +280,78 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>>
>>   #endif
>>
>> +/**
>> + * page_cgroup_get_owner() - get the owner ID of a page
>> + * @page:	the page we want to find the owner
>> + *
>> + * Returns the owner ID of the page, 0 means that the owner cannot be
>> + * retrieved.
>> + **/
>> +unsigned long page_cgroup_get_owner(struct page *page)
>> +{
>> +	struct page_cgroup *pc;
>> +	unsigned long ret;
>> +
>> +	pc = lookup_page_cgroup(page);
>> +	if (unlikely(!pc))
>> +		return 0;
>> +
>> +	lock_page_cgroup(pc);
>> +	ret = page_cgroup_get_id(pc);
>> +	unlock_page_cgroup(pc);
>> +	return ret;
>> +}
>
> Is this lock required ? Even if wrong ID is got by race, it just means
> this I/O is charged to other cgroup.
> I don't think it's an issue. Considering the user can move tasks while generating
> I/O without any interaction with I/O cgroup, the issue itself is invisible and
> cannot be handled. I love light wegiht here.

I agree.


>> +
>> +/**
>> + * page_cgroup_set_owner() - set the owner ID of a page
>> + * @page:	the page we want to tag
>> + * @id:		the ID number that will be associated to page
>> + *
>> + * Returns 0 if the owner is correctly associated to the page. Returns a
>> + * negative value in case of failure.
>> + **/
>> +int page_cgroup_set_owner(struct page *page, unsigned long id)
>> +{
>> +	struct page_cgroup *pc;
>> +
>> +	pc = lookup_page_cgroup(page);
>> +	if (unlikely(!pc))
>> +		return -ENOENT;
>> +
>> +	lock_page_cgroup(pc);
>> +	page_cgroup_set_id(pc, id);
>> +	unlock_page_cgroup(pc);
>> +	return 0;
>> +}
>> +
>
> Doesn't this function check overwrite ?

No it doesn't.  This function is called only when the info is
needed to be overwritten if I'm correct.
If you have another concern, please correct me.


>> +/**
>> + * page_cgroup_copy_owner() - copy the owner ID of a page into another page
>> + * @npage:	the page where we want to copy the owner
>> + * @opage:	the page from which we want to copy the ID
>> + *
>> + * Returns 0 if the owner is correctly associated to npage. Returns a negative
>> + * value in case of failure.
>> + **/
>> +int page_cgroup_copy_owner(struct page *npage, struct page *opage)
>> +{
>> +	struct page_cgroup *npc, *opc;
>> +	unsigned long id;
>> +
>> +	npc = lookup_page_cgroup(npage);
>> +	if (unlikely(!npc))
>> +		return -ENOENT;
>> +	opc = lookup_page_cgroup(opage);
>> +	if (unlikely(!opc))
>> +		return -ENOENT;
>> +	lock_page_cgroup(opc);
>> +	lock_page_cgroup(npc);
>> +	id = page_cgroup_get_id(opc);
>> +	page_cgroup_set_id(npc, id);
>> +	unlock_page_cgroup(npc);
>> +	unlock_page_cgroup(opc);
>> +
>> +	return 0;
>> +}
>
> you can remove lock here, too. (use cmpxchg for writer.)

OK.


>
> Thanks,
> -Kame


Thank you so much!
Muuhh


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09  7:38   ` KAMEZAWA Hiroyuki
@ 2010-07-09 23:09     ` Munehiro Ikeda
  2010-07-10 10:06       ` Andrea Righi
  0 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09 23:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, Andrea Righi,
	Gui Jianfeng, akpm, balbir, Muuhh Ikeda

KAMEZAWA Hiroyuki wrote, on 07/09/2010 03:38 AM:
> On Thu, 08 Jul 2010 23:15:30 -0400
> Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
>
>> Signed-off-by: Hirokazu Takahashi<taka@valinux.co.jp>
>> Signed-off-by: Ryo Tsuruta<ryov@valinux.co.jp>
>> Signed-off-by: Andrea Righi<arighi@develer.com>
>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>> ---
>>   block/Kconfig.iosched       |    8 +++
>>   block/Makefile              |    1 +
>>   block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
>>   include/linux/page_cgroup.h |   25 ++++++++
>>   init/Kconfig                |    2 +-
>>   mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
>>   7 files changed, 309 insertions(+), 9 deletions(-)
>>   create mode 100644 block/blk-iotrack.c
>>   create mode 100644 include/linux/blk-iotrack.h

(snip)

>> +/**
>> + * blk_iotrack_set_owner() - set the owner ID of a page.
>> + * @page:	the page we want to tag
>> + * @mm:		the mm_struct of a page owner
>> + *
>> + * Make a given page have the blkio-cgroup ID of the owner of this page.
>> + */
>> +int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm)
>> +{
>> +	struct blkio_cgroup *blkcg;
>> +	unsigned short id = 0;	/* 0: default blkio_cgroup id */
>> +
>> +	if (blk_iotrack_disabled())
>> +		return 0;
>> +	if (!mm)
>> +		goto out;
>> +
>> +	rcu_read_lock();
>> +	blkcg = task_to_blkio_cgroup(rcu_dereference(mm->owner));
>> +	if (likely(blkcg))
>> +		id = css_id(&blkcg->css);
>> +	rcu_read_unlock();
>> +out:
>> +	return page_cgroup_set_owner(page, id);
>> +}
>> +
>
> I think this is bad. I/O should be charged against threads i.e. "current",
> because CFQ/blockio-cgroup provides per-thread control.
> mm->owner is not suitable, I think.

OK, thanks.


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack
  2010-07-09  9:24   ` Andrea Righi
@ 2010-07-09 23:43     ` Munehiro Ikeda
  0 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-09 23:43 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Gui Jianfeng, akpm, balbir, Muuhh Ikeda

Andrea Righi wrote, on 07/09/2010 05:24 AM:
> On Thu, Jul 08, 2010 at 11:16:28PM -0400, Munehiro Ikeda wrote:
>> Embedding hooks for iotrack to record process info, namely
>> cgroup ID.
>> This patch is based on a patch posted from Ryo Tsuruta on Oct 2,
>> 2009 titled "Page tracking hooks".
>>
>> Signed-off-by: Hirokazu Takahashi<taka@valinux.co.jp>
>> Signed-off-by: Ryo Tsuruta<ryov@valinux.co.jp>
>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>> ---
>>   fs/buffer.c         |    2 ++
>>   fs/direct-io.c      |    2 ++
>>   mm/bounce.c         |    2 ++
>>   mm/filemap.c        |    2 ++
>>   mm/memory.c         |    5 +++++
>>   mm/page-writeback.c |    2 ++
>>   mm/swap_state.c     |    2 ++
>>   7 files changed, 17 insertions(+), 0 deletions(-)
>>

(snip)

>> diff --git a/mm/memory.c b/mm/memory.c
>> index 119b7cc..3eb2d0d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -52,6 +52,7 @@
>>   #include<linux/init.h>
>>   #include<linux/writeback.h>
>>   #include<linux/memcontrol.h>
>> +#include<linux/blk-iotrack.h>
>>   #include<linux/mmu_notifier.h>
>>   #include<linux/kallsyms.h>
>>   #include<linux/swapops.h>
>> @@ -2283,6 +2284,7 @@ gotten:
>>   		 */
>>   		ptep_clear_flush(vma, address, page_table);
>>   		page_add_new_anon_rmap(new_page, vma, address);
>> +		blk_iotrack_set_owner(new_page, mm);
>>   		/*
>>   		 * We call the notify macro here because, when using secondary
>>   		 * mmu page tables (such as kvm shadow page tables), we want the
>> @@ -2718,6 +2720,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>   	flush_icache_page(vma, page);
>>   	set_pte_at(mm, address, page_table, pte);
>>   	page_add_anon_rmap(page, vma, address);
>> +	blk_iotrack_reset_owner(page, mm);
>>   	/* It's better to call commit-charge after rmap is established */
>>   	mem_cgroup_commit_charge_swapin(page, ptr);
>>
>> @@ -2795,6 +2798,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>
>>   	inc_mm_counter_fast(mm, MM_ANONPAGES);
>>   	page_add_new_anon_rmap(page, vma, address);
>> +	blk_iotrack_set_owner(page, mm);
>
> We're tracking anonymous memory here. Should we charge the cost of the
> swap IO to the root cgroup or the actual owner of the page? there was a
> previous discussion about this topic, but can't remember the details,
> sorry.
>
> IMHO we could remove some overhead simply ignoring the tracking of
> anonymous pages (swap IO) and just consider direct and writeback IO of
> file pages.

Well, this needs a decision.  Actually who should be charged for
swap IO has multiple options.

(1) root cgroup
(2) page owner
(3) memory hogger who triggered swap

The choice of this patch is (2).  However, I agree that there is no
concrete rationale to select (2) but (3), and (1) is the simplest
and the best for performance.
Alright, I will remove anonymous pages from tracking target.


Thanks,
Muuhh



>>   setpte:
>>   	set_pte_at(mm, address, page_table, entry);
>>
>> @@ -2949,6 +2953,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   		if (anon) {
>>   			inc_mm_counter_fast(mm, MM_ANONPAGES);
>>   			page_add_new_anon_rmap(page, vma, address);
>> +			blk_iotrack_set_owner(page, mm);
>
> Ditto.
>
>>   		} else {
>>   			inc_mm_counter_fast(mm, MM_FILEPAGES);
>>   			page_add_file_rmap(page);
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 54f28bd..f3e6b2c 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -23,6 +23,7 @@
>>   #include<linux/init.h>
>>   #include<linux/backing-dev.h>
>>   #include<linux/task_io_accounting_ops.h>
>> +#include<linux/blk-iotrack.h>
>>   #include<linux/blkdev.h>
>>   #include<linux/mpage.h>
>>   #include<linux/rmap.h>
>> @@ -1128,6 +1129,7 @@ int __set_page_dirty_nobuffers(struct page *page)
>>   			BUG_ON(mapping2 != mapping);
>>   			WARN_ON_ONCE(!PagePrivate(page)&&  !PageUptodate(page));
>>   			account_page_dirtied(page, mapping);
>> +			blk_iotrack_reset_owner_pagedirty(page, current->mm);
>>   			radix_tree_tag_set(&mapping->page_tree,
>>   				page_index(page), PAGECACHE_TAG_DIRTY);
>>   		}
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index e10f583..ab26978 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -19,6 +19,7 @@
>>   #include<linux/pagevec.h>
>>   #include<linux/migrate.h>
>>   #include<linux/page_cgroup.h>
>> +#include<linux/blk-iotrack.h>
>>
>>   #include<asm/pgtable.h>
>>
>> @@ -324,6 +325,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>   		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
>>   		__set_page_locked(new_page);
>>   		SetPageSwapBacked(new_page);
>> +		blk_iotrack_set_owner(new_page, current->mm);
>
> Ditto.
>
>>   		err = __add_to_swap_cache(new_page, entry);
>>   		if (likely(!err)) {
>>   			radix_tree_preload_end();
>> --
>> 1.6.2.5
>
> Thanks,
> -Andrea
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-09 13:45 ` Vivek Goyal
@ 2010-07-10  0:17   ` Munehiro Ikeda
  2010-07-10  0:55     ` Nauman Rafique
  0 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-07-10  0:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir, Muuhh Ikeda

Vivek Goyal wrote, on 07/09/2010 09:45 AM:
> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>> These RFC patches are trial to add async (cached) write support on blkio
>> controller.
>>
>> Only test which has been done is to compile, boot, and that write bandwidth
>> seems prioritized when pages which were dirtied by two different processes in
>> different cgroups are written back to a device simultaneously.  I know this
>> is the minimum (or less) test but I posted this as RFC because I would like
>> to hear your opinions about the design direction in the early stage.
>>
>> Patches are for 2.6.35-rc4.
>>
>> This patch series consists of two chunks.
>>
>> (1) iotrack (patch 01/11 -- 06/11)
>>
>> This is a functionality to track who dirtied a page, in exact which cgroup a
>> process which dirtied a page belongs to.  Blkio controller will read the info
>> later and prioritize when the page is actually written to a block device.
>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
>> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
>> proposals for IO controller.
>>
>>
>> (2) blkio controller modification (07/11 -- 11/11)
>>
>> The main part of blkio controller async write support.
>> Currently async queues are device-wide and async write IOs are always treated
>> as root group.
>> These patches make async queues per a cfq_group per a device to control them.
>> Async write is handled by flush kernel thread.  Because queue pointers are
>> stored in cfq_io_context, io_context of the thread has to have multiple
>> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
>> io_context per a cfq_group, which means per an io_context per a cgroup per a
>> device.
>>
>>
>> This might be a piece of puzzle for complete async write support of blkio
>> controller.  One of other pieces in my head is page dirtying ratio control.
>> I believe Andrea Righi was working on it...how about the situation?
>
> Thanks Muuh. I will look into the patches in detail.
>
> In my initial patches I had implemented the support for ASYNC control
> (also included Ryo's IO tracking patches) but it did not work well and
> it was unpredictable. I realized that until and unless we implement
> some kind of per group dirty ratio/page cache share at VM level and
> create parallel paths for ASYNC IO, writes often get serialized.
>
> So writes belonging to high priority group get stuck behind low priority
> group and you don't get any service differentiation.

I also faced the situation that high priority writes are behind
lower priority writes.  Although this patch seems to prioritize
IOs if these IOs are contended, yes, it is rare a bit because they
are serialized often.


> So IMHO, this piece should go into kernel after we have first fixed the
> problem at VM (read memory controller) with per cgroup dirty ratio kind
> of thing.

Well, right.  I agree.
But I think we can work parallel.  I will try to struggle on both.

By the way, I guess that write serialization is caused by page selection
of flush kernel thread.  If so, simple dirty ratio/page cache share
controlling don't seem to be able to solve that for me.  Instead or in
addition to it, page selection order should be modified.  Am I correct?


>> And also, I'm thinking that async write support is required by bandwidth
>> capping policy of blkio controller.  Bandwidth capping can be done in upper
>> layer than elevator.
>
> I think capping facility we should implement in higher layers otherwise
> it is not useful for higher level logical devices (dm/md).
>
> It was ok to implement proportional bandwidth division at CFQ level
> because one can do proportional BW division at each leaf node and still get
> overall service differentation at higher level logical node. But same can
> not be done for max BW control.

A reason why I prefer to have BW control in elevator is
based on my evaluation result of three proposed IO controller
comparison before blkio controller was merged.  Three proposals
were dm-ioband, io-throttle, and elevator implementation which is
the closest one to current blkio controller.  Former two handled
BIOs and only last one handled REQUESTs.  The result shows that
only handling REQUESTs can produce expected service differentiation.
Though I've not dived into the cause analysis, I guess that BIO
is not associated with actual IO request one by one and elevator
behavior are possibly the cause.
But on the other hand, as you say, BW controller in elevator
cannot control logical devices (or quite hard to adapt to them).
It's painful situation.

I will analyse the cause of non-differentiation in BIO handling
case much deeper.


>>   However I think it should be also done in elevator layer
>> in my opinion.  Elevator buffers and sort requests.  If there is another
>> buffering functionality in upper layer, it is doubled buffering and it can be
>> harmful for elevator's prediction.
>
> I don't mind doing it at elevator layer also because in that case of
> somebody is not using dm/md, then one does not have to load max bw
> control module and one can simply enable max bw control in CFQ.
>
> Thinking more about it, now we are suggesting implementing max BW
> control at two places. I think it will be duplication of code and
> increased complexity in CFQ. Probably implement max bw control with
> the help of dm module and use same for CFQ also. There is pain
> associated with configuring dm device but I guess it is easier than
> maintaining two max bw control schemes in kernel.

Do you mean that sharing code for max BW control between dm and CFQ
is a possible solution?  It's interesting.  I will think about it.


> Thanks
> Vivek


Greatly thanks for your suggestion as always.
Muuhh


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-10  0:17   ` Munehiro Ikeda
@ 2010-07-10  0:55     ` Nauman Rafique
  2010-07-10 13:24       ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: Nauman Rafique @ 2010-07-10  0:55 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 9, 2010 at 5:17 PM, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 07/09/2010 09:45 AM:
>>
>> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>>>
>>> These RFC patches are trial to add async (cached) write support on blkio
>>> controller.
>>>
>>> Only test which has been done is to compile, boot, and that write
>>> bandwidth
>>> seems prioritized when pages which were dirtied by two different
>>> processes in
>>> different cgroups are written back to a device simultaneously.  I know
>>> this
>>> is the minimum (or less) test but I posted this as RFC because I would
>>> like
>>> to hear your opinions about the design direction in the early stage.
>>>
>>> Patches are for 2.6.35-rc4.
>>>
>>> This patch series consists of two chunks.
>>>
>>> (1) iotrack (patch 01/11 -- 06/11)
>>>
>>> This is a functionality to track who dirtied a page, in exact which
>>> cgroup a
>>> process which dirtied a page belongs to.  Blkio controller will read the
>>> info
>>> later and prioritize when the page is actually written to a block device.
>>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and
>>> includes
>>> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one
>>> of
>>> proposals for IO controller.
>>>
>>>
>>> (2) blkio controller modification (07/11 -- 11/11)
>>>
>>> The main part of blkio controller async write support.
>>> Currently async queues are device-wide and async write IOs are always
>>> treated
>>> as root group.
>>> These patches make async queues per a cfq_group per a device to control
>>> them.
>>> Async write is handled by flush kernel thread.  Because queue pointers
>>> are
>>> stored in cfq_io_context, io_context of the thread has to have multiple
>>> cfq_io_contexts per a device.  So these patches make cfq_io_context per
>>> an
>>> io_context per a cfq_group, which means per an io_context per a cgroup
>>> per a
>>> device.
>>>
>>>
>>> This might be a piece of puzzle for complete async write support of blkio
>>> controller.  One of other pieces in my head is page dirtying ratio
>>> control.
>>> I believe Andrea Righi was working on it...how about the situation?
>>
>> Thanks Muuh. I will look into the patches in detail.
>>
>> In my initial patches I had implemented the support for ASYNC control
>> (also included Ryo's IO tracking patches) but it did not work well and
>> it was unpredictable. I realized that until and unless we implement
>> some kind of per group dirty ratio/page cache share at VM level and
>> create parallel paths for ASYNC IO, writes often get serialized.
>>
>> So writes belonging to high priority group get stuck behind low priority
>> group and you don't get any service differentiation.
>
> I also faced the situation that high priority writes are behind
> lower priority writes.  Although this patch seems to prioritize
> IOs if these IOs are contended, yes, it is rare a bit because they
> are serialized often.
>
>
>> So IMHO, this piece should go into kernel after we have first fixed the
>> problem at VM (read memory controller) with per cgroup dirty ratio kind
>> of thing.
>
> Well, right.  I agree.
> But I think we can work parallel.  I will try to struggle on both.

IMHO, we have a classic chicken and egg problem here. We should try to
merge pieces as they become available. If we get to agree on patches
that do async IO tracking for IO controller, we should go ahead with
them instead of trying to wait for per cgroup dirty ratios.

In terms of getting numbers, we have been using patches that add per
cpuset dirty ratios on top of NUMA_EMU, and we get good
differentiation between buffered writes as well as buffered writes vs.
reads.

It is really obvious that as long as flusher threads ,etc are not
cgroup aware, differentiation for buffered writes would not be perfect
in all cases, but this is a step in the right direction and we should
go for it.

>
> By the way, I guess that write serialization is caused by page selection
> of flush kernel thread.  If so, simple dirty ratio/page cache share
> controlling don't seem to be able to solve that for me.  Instead or in
> addition to it, page selection order should be modified.  Am I correct?
>
>
>>> And also, I'm thinking that async write support is required by bandwidth
>>> capping policy of blkio controller.  Bandwidth capping can be done in
>>> upper
>>> layer than elevator.
>>
>> I think capping facility we should implement in higher layers otherwise
>> it is not useful for higher level logical devices (dm/md).
>>
>> It was ok to implement proportional bandwidth division at CFQ level
>> because one can do proportional BW division at each leaf node and still
>> get
>> overall service differentation at higher level logical node. But same can
>> not be done for max BW control.
>
> A reason why I prefer to have BW control in elevator is
> based on my evaluation result of three proposed IO controller
> comparison before blkio controller was merged.  Three proposals
> were dm-ioband, io-throttle, and elevator implementation which is
> the closest one to current blkio controller.  Former two handled
> BIOs and only last one handled REQUESTs.  The result shows that
> only handling REQUESTs can produce expected service differentiation.
> Though I've not dived into the cause analysis, I guess that BIO
> is not associated with actual IO request one by one and elevator
> behavior are possibly the cause.
> But on the other hand, as you say, BW controller in elevator
> cannot control logical devices (or quite hard to adapt to them).
> It's painful situation.
>
> I will analyse the cause of non-differentiation in BIO handling
> case much deeper.
>
>
>>>  However I think it should be also done in elevator layer
>>> in my opinion.  Elevator buffers and sort requests.  If there is another
>>> buffering functionality in upper layer, it is doubled buffering and it
>>> can be
>>> harmful for elevator's prediction.
>>
>> I don't mind doing it at elevator layer also because in that case of
>> somebody is not using dm/md, then one does not have to load max bw
>> control module and one can simply enable max bw control in CFQ.
>>
>> Thinking more about it, now we are suggesting implementing max BW
>> control at two places. I think it will be duplication of code and
>> increased complexity in CFQ. Probably implement max bw control with
>> the help of dm module and use same for CFQ also. There is pain
>> associated with configuring dm device but I guess it is easier than
>> maintaining two max bw control schemes in kernel.
>
> Do you mean that sharing code for max BW control between dm and CFQ
> is a possible solution?  It's interesting.  I will think about it.
>
>
>> Thanks
>> Vivek
>
>
> Greatly thanks for your suggestion as always.
> Muuhh
>
>
> --
> IKEDA, Munehiro
>  NEC Corporation of America
>    m-ikeda@ds.jp.nec.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09 23:09     ` Munehiro Ikeda
@ 2010-07-10 10:06       ` Andrea Righi
  0 siblings, 0 replies; 53+ messages in thread
From: Andrea Righi @ 2010-07-10 10:06 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Vivek Goyal, Ryo Tsuruta, taka,
	Gui Jianfeng, akpm, balbir

On Fri, Jul 09, 2010 at 07:09:31PM -0400, Munehiro Ikeda wrote:
> KAMEZAWA Hiroyuki wrote, on 07/09/2010 03:38 AM:
> >On Thu, 08 Jul 2010 23:15:30 -0400
> >Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
> >
> >>Signed-off-by: Hirokazu Takahashi<taka@valinux.co.jp>
> >>Signed-off-by: Ryo Tsuruta<ryov@valinux.co.jp>
> >>Signed-off-by: Andrea Righi<arighi@develer.com>
> >>Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
> >>---
> >>  block/Kconfig.iosched       |    8 +++
> >>  block/Makefile              |    1 +
> >>  block/blk-iotrack.c         |  129 +++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/blk-iotrack.h |   62 +++++++++++++++++++++
> >>  include/linux/page_cgroup.h |   25 ++++++++
> >>  init/Kconfig                |    2 +-
> >>  mm/page_cgroup.c            |   91 +++++++++++++++++++++++++++---
> >>  7 files changed, 309 insertions(+), 9 deletions(-)
> >>  create mode 100644 block/blk-iotrack.c
> >>  create mode 100644 include/linux/blk-iotrack.h
> 
> (snip)
> 
> >>+/**
> >>+ * blk_iotrack_set_owner() - set the owner ID of a page.
> >>+ * @page:	the page we want to tag
> >>+ * @mm:		the mm_struct of a page owner
> >>+ *
> >>+ * Make a given page have the blkio-cgroup ID of the owner of this page.
> >>+ */
> >>+int blk_iotrack_set_owner(struct page *page, struct mm_struct *mm)
> >>+{
> >>+	struct blkio_cgroup *blkcg;
> >>+	unsigned short id = 0;	/* 0: default blkio_cgroup id */
> >>+
> >>+	if (blk_iotrack_disabled())
> >>+		return 0;
> >>+	if (!mm)
> >>+		goto out;
> >>+
> >>+	rcu_read_lock();
> >>+	blkcg = task_to_blkio_cgroup(rcu_dereference(mm->owner));
> >>+	if (likely(blkcg))
> >>+		id = css_id(&blkcg->css);
> >>+	rcu_read_unlock();
> >>+out:
> >>+	return page_cgroup_set_owner(page, id);
> >>+}
> >>+
> >
> >I think this is bad. I/O should be charged against threads i.e. "current",
> >because CFQ/blockio-cgroup provides per-thread control.
> >mm->owner is not suitable, I think.
> 
> OK, thanks.

BTW, this should be automatically fixed if you remove anonymous pages
tracking.

-Andrea

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-10  0:55     ` Nauman Rafique
@ 2010-07-10 13:24       ` Vivek Goyal
  2010-07-12  0:20         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-10 13:24 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 09, 2010 at 05:55:23PM -0700, Nauman Rafique wrote:

[..]
> > Well, right.  I agree.
> > But I think we can work parallel.  I will try to struggle on both.
> 
> IMHO, we have a classic chicken and egg problem here. We should try to
> merge pieces as they become available. If we get to agree on patches
> that do async IO tracking for IO controller, we should go ahead with
> them instead of trying to wait for per cgroup dirty ratios.
> 
> In terms of getting numbers, we have been using patches that add per
> cpuset dirty ratios on top of NUMA_EMU, and we get good
> differentiation between buffered writes as well as buffered writes vs.
> reads.
> 
> It is really obvious that as long as flusher threads ,etc are not
> cgroup aware, differentiation for buffered writes would not be perfect
> in all cases, but this is a step in the right direction and we should
> go for it.

Working parallel on two separate pieces is fine. But pushing second piece
in first does not make much sense to me because second piece does not work
if first piece is not in. There is no way to test it. What's the point of
pushing a code in kernel which only compiles but does not achieve intented
purposes because some other pieces are missing.

Per cgroup dirty ratio is a little hard problem and few attempts have
already been made at it. IMHO, we need to first work on that piece and
get it inside the kernel and then work on IO tracking patches. Lets
fix the hard problem first that is necessary to make second set of patches
work.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-09 23:06     ` Munehiro Ikeda
@ 2010-07-12  0:11       ` KAMEZAWA Hiroyuki
  2010-07-14 14:46         ` Munehiro IKEDA
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-12  0:11 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, Andrea Righi,
	Gui Jianfeng, akpm, balbir

On Fri, 09 Jul 2010 19:06:28 -0400
Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:

> OK, we can do it like:
> 
> do {
> 	old =  pc->flags;
> 	new = old & ((1UL << PAGE_TRACKING_ID_SHIFT) - 1);
> 	new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
> } while (cmpxchg(&pc->flags, old, new)  != old);
> 
> 
> > IIUC, there is an generic version now even if it's heavy.
> 
> I couldn't find it out...is there already?  Or you mean we should
> have generic one?
> 

generic cmpxchg in /include/asm-generic/cmpxchg.h
But...ahh...in some arch, you can't use cmpxchg, sorry.

How about start from "a new field" for I/O cgroup in page_cgroup ?

struct page_cgroup {
        unsigned long flags;
        struct mem_cgroup *mem_cgroup;
	unsigned short blkio_cgroup_id;
        struct page *page;
        struct list_head lru;           /* per cgroup LRU list */
};

We can consider how we optimize it out, later.
(And, it's easier to debug and make development smooth.)

For example, If we can create a vmalloced-array of mem_cgroup,
id->mem_cgroup lookup can be very fast and we can replace
pc->mem_cgroup link with pc->mem_cgroup_id.


> 
> >> +unsigned long page_cgroup_get_owner(struct page *page);
> >> +int page_cgroup_set_owner(struct page *page, unsigned long id);
> >> +int page_cgroup_copy_owner(struct page *npage, struct page *opage);
> >> +
> >>   void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> >>
> >>   #ifdef CONFIG_SPARSEMEM
> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index 2e40f2f..337ee01 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -650,7 +650,7 @@ endif # CGROUPS
> >>
> >>   config CGROUP_PAGE
> >>   	def_bool y
> >> -	depends on CGROUP_MEM_RES_CTLR
> >> +	depends on CGROUP_MEM_RES_CTLR || GROUP_IOSCHED_ASYNC
> >>
> >>   config MM_OWNER
> >>   	bool
> >> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> >> index 6c00814..69e080c 100644
> >> --- a/mm/page_cgroup.c
> >> +++ b/mm/page_cgroup.c
> >> @@ -9,6 +9,7 @@
> >>   #include<linux/vmalloc.h>
> >>   #include<linux/cgroup.h>
> >>   #include<linux/swapops.h>
> >> +#include<linux/blk-iotrack.h>
> >>
> >>   static void __meminit
> >>   __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> >> @@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
> >>
> >>   	int nid, fail;
> >>
> >> -	if (mem_cgroup_disabled())
> >> +	if (mem_cgroup_disabled()&&  blk_iotrack_disabled())
> >>   		return;
> >>
> >>   	for_each_online_node(nid)  {
> >> @@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
> >>   			goto fail;
> >>   	}
> >>   	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> >> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
> >> -	" don't want memory cgroups\n");
> >> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
> >> +	" if you don't want memory and blkio cgroups\n");
> >>   	return;
> >>   fail:
> >>   	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
> >> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
> >> +	printk(KERN_CRIT
> >> +		"please try 'cgroup_disable=memory,blkio' boot option\n");
> >>   	panic("Out of memory");
> >>   }
> >
> > Hmm, io-track is always done if blkio-cgroup is used, Right ?
> 
> No, iotrack is enabled only when CONFIG_GROUP_IOSCHED_ASYNC=y.
> If =n, iotrack is disabled even if blkio-cgroup is enabled.
> 


> 
> > Then, why you have extra config as CONFIG_GROUP_IOSCHED_ASYNC ?
> > Is it necessary ?
> 
> Current purpose of the option is only for debug because it is
> experimental functionality.
> It can be removed if this work is completed, or dynamic switch
> might be useful.
> 
> In fact, just "cgroup_disable=memory" is enough for the failure
> case.  Let me think right messages.
> 

IMHO, once you add boot-option or sysctl, it's very hard to remove it, lator.
So, if you think you'll remove it lator, don't add it or just add CONFIG.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-10 13:24       ` Vivek Goyal
@ 2010-07-12  0:20         ` KAMEZAWA Hiroyuki
  2010-07-12 13:18           ` Vivek Goyal
  2010-07-22 19:28           ` Greg Thelen
  0 siblings, 2 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-12  0:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Sat, 10 Jul 2010 09:24:17 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Fri, Jul 09, 2010 at 05:55:23PM -0700, Nauman Rafique wrote:
> 
> [..]
> > > Well, right.  I agree.
> > > But I think we can work parallel.  I will try to struggle on both.
> > 
> > IMHO, we have a classic chicken and egg problem here. We should try to
> > merge pieces as they become available. If we get to agree on patches
> > that do async IO tracking for IO controller, we should go ahead with
> > them instead of trying to wait for per cgroup dirty ratios.
> > 
> > In terms of getting numbers, we have been using patches that add per
> > cpuset dirty ratios on top of NUMA_EMU, and we get good
> > differentiation between buffered writes as well as buffered writes vs.
> > reads.
> > 
> > It is really obvious that as long as flusher threads ,etc are not
> > cgroup aware, differentiation for buffered writes would not be perfect
> > in all cases, but this is a step in the right direction and we should
> > go for it.
> 
> Working parallel on two separate pieces is fine. But pushing second piece
> in first does not make much sense to me because second piece does not work
> if first piece is not in. There is no way to test it. What's the point of
> pushing a code in kernel which only compiles but does not achieve intented
> purposes because some other pieces are missing.
> 
> Per cgroup dirty ratio is a little hard problem and few attempts have
> already been made at it. IMHO, we need to first work on that piece and
> get it inside the kernel and then work on IO tracking patches. Lets
> fix the hard problem first that is necessary to make second set of patches
> work.
> 

I've just waited for dirty-ratio patches because I know someone is working on.
But, hmm, I'll consider to start work by myself.

(Off-topic)
BTW, why io-cgroup's hierarchy level is limited to 2 ?
Because of that limitation, libvirt can't work well...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-12  0:20         ` KAMEZAWA Hiroyuki
@ 2010-07-12 13:18           ` Vivek Goyal
  2010-07-13  4:36             ` KAMEZAWA Hiroyuki
  2010-07-22 19:28           ` Greg Thelen
  1 sibling, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-12 13:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Mon, Jul 12, 2010 at 09:20:04AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sat, 10 Jul 2010 09:24:17 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Fri, Jul 09, 2010 at 05:55:23PM -0700, Nauman Rafique wrote:
> > 
> > [..]
> > > > Well, right.  I agree.
> > > > But I think we can work parallel.  I will try to struggle on both.
> > > 
> > > IMHO, we have a classic chicken and egg problem here. We should try to
> > > merge pieces as they become available. If we get to agree on patches
> > > that do async IO tracking for IO controller, we should go ahead with
> > > them instead of trying to wait for per cgroup dirty ratios.
> > > 
> > > In terms of getting numbers, we have been using patches that add per
> > > cpuset dirty ratios on top of NUMA_EMU, and we get good
> > > differentiation between buffered writes as well as buffered writes vs.
> > > reads.
> > > 
> > > It is really obvious that as long as flusher threads ,etc are not
> > > cgroup aware, differentiation for buffered writes would not be perfect
> > > in all cases, but this is a step in the right direction and we should
> > > go for it.
> > 
> > Working parallel on two separate pieces is fine. But pushing second piece
> > in first does not make much sense to me because second piece does not work
> > if first piece is not in. There is no way to test it. What's the point of
> > pushing a code in kernel which only compiles but does not achieve intented
> > purposes because some other pieces are missing.
> > 
> > Per cgroup dirty ratio is a little hard problem and few attempts have
> > already been made at it. IMHO, we need to first work on that piece and
> > get it inside the kernel and then work on IO tracking patches. Lets
> > fix the hard problem first that is necessary to make second set of patches
> > work.
> > 
> 
> I've just waited for dirty-ratio patches because I know someone is working on.
> But, hmm, I'll consider to start work by myself.
> 

If you can spare time to get it going, it would be great.

> (Off-topic)
> BTW, why io-cgroup's hierarchy level is limited to 2 ?
> Because of that limitation, libvirt can't work well...

Because current CFQ code is not written to support hierarchy. So it was
better to not allow creation of groups inside of groups to avoid suprises.

We need to figure out something for libvirt. One of the options would be
that libvirt allows blkio group creation in /root. Or one shall have to
look into hierarchical support in CFQ.

Things get little complicated in CFQ once we want to support hierarchy.
And to begin with I am not expecting many people to really create groups
inside groups. That's why I am currently focussing on making sure that
current infrastructure works well instead of just adding more features to
it.

Few things I am looking into.

- CFQ performance is not good at high end storage. So group control also
  suffers from same issue. Trying to introduce group_idle tunable to
  solve some of the problems.

- Even after group_idle, overall throughput suffers if groups don't have
  enough traffic to keep the array busy. Trying to create a mode where a
  user can specify to let fairness go if groups don't have enough traffic
  to keep array busy.

- Request descriptors are still per queue and not per group. I noticed the
  moment we create more groups, we start running into the issue of not
  enough request descriptors and it starts introducing serialization among
  groups. Need to have per group request descriptor intrastructure in.

First I am planning to sort out above issues and then look into other
enhancements.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-12 13:18           ` Vivek Goyal
@ 2010-07-13  4:36             ` KAMEZAWA Hiroyuki
  2010-07-14 14:29               ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-13  4:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Mon, 12 Jul 2010 09:18:05 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> > I've just waited for dirty-ratio patches because I know someone is working on.
> > But, hmm, I'll consider to start work by myself.
> > 
> 
> If you can spare time to get it going, it would be great.
> 
> > (Off-topic)
> > BTW, why io-cgroup's hierarchy level is limited to 2 ?
> > Because of that limitation, libvirt can't work well...
> 
> Because current CFQ code is not written to support hierarchy. So it was
> better to not allow creation of groups inside of groups to avoid suprises.
> 
> We need to figure out something for libvirt. One of the options would be
> that libvirt allows blkio group creation in /root. Or one shall have to
> look into hierarchical support in CFQ.
> 

Hmm, can't we start from a hierarchy which doesn't support inheritance ?
IOW, blkio cgroup has children directories but all cgroups are treated as
flat. In future, true hierarchy support may be added and you may able to
use it via mount option....
For example, memory cgroup's hierarchy support is optional..because it's slow.

Cgroup's feature as mounting several subsystems at a mount point at once
is very useful in many case.

Thanks
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-13  4:36             ` KAMEZAWA Hiroyuki
@ 2010-07-14 14:29               ` Vivek Goyal
  2010-07-15  0:00                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-14 14:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Tue, Jul 13, 2010 at 01:36:36PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 12 Jul 2010 09:18:05 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > > I've just waited for dirty-ratio patches because I know someone is working on.
> > > But, hmm, I'll consider to start work by myself.
> > > 
> > 
> > If you can spare time to get it going, it would be great.
> > 
> > > (Off-topic)
> > > BTW, why io-cgroup's hierarchy level is limited to 2 ?
> > > Because of that limitation, libvirt can't work well...
> > 
> > Because current CFQ code is not written to support hierarchy. So it was
> > better to not allow creation of groups inside of groups to avoid suprises.
> > 
> > We need to figure out something for libvirt. One of the options would be
> > that libvirt allows blkio group creation in /root. Or one shall have to
> > look into hierarchical support in CFQ.
> > 
> 
> Hmm, can't we start from a hierarchy which doesn't support inheritance ?
> IOW, blkio cgroup has children directories but all cgroups are treated as
> flat. In future, true hierarchy support may be added and you may able to
> use it via mount option....
> For example, memory cgroup's hierarchy support is optional..because it's slow.

I think doing that is even more cofusing to the user where cgroup dir
structure show hierarchy of groups but in practice that's not the case. It
is easier to deny creating child groups with-in groups right away and
let user space mount blkio at a different mount point and plan the
resource usage accordingly.

> 
> Cgroup's feature as mounting several subsystems at a mount point at once
> is very useful in many case.

I agree that it is useful but if some controllers are not supporting
hierarchy, it just adds to more confusion. And later when hierarchy
support comes in, there will be additional issue of keeping this file
"use_hierarchy" like memory controller.

So at this point of time , I am not too inclined towards allowing hierarchical
cgroup creation but treating them as flat in CFQ. I think it adds to the
confusion and user space should handle this situation.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 02/11] blkiocg async: The main part of iotrack
  2010-07-12  0:11       ` KAMEZAWA Hiroyuki
@ 2010-07-14 14:46         ` Munehiro IKEDA
  0 siblings, 0 replies; 53+ messages in thread
From: Munehiro IKEDA @ 2010-07-14 14:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, Andrea Righi,
	Gui Jianfeng, akpm, balbir

KAMEZAWA Hiroyuki wrote: 
> On Fri, 09 Jul 2010 19:06:28 -0400
> Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> 
>> OK, we can do it like:
>>
>> do {
>> 	old =  pc->flags;
>> 	new = old & ((1UL << PAGE_TRACKING_ID_SHIFT) - 1);
>> 	new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
>> } while (cmpxchg(&pc->flags, old, new)  != old);
>>
>>
>>> IIUC, there is an generic version now even if it's heavy.
>> I couldn't find it out...is there already?  Or you mean we should
>> have generic one?
>>
> 
> generic cmpxchg in /include/asm-generic/cmpxchg.h
> But...ahh...in some arch, you can't use cmpxchg, sorry.
> 
> How about start from "a new field" for I/O cgroup in page_cgroup ?
> 
> struct page_cgroup {
>         unsigned long flags;
>         struct mem_cgroup *mem_cgroup;
> 	unsigned short blkio_cgroup_id;
>         struct page *page;
>         struct list_head lru;           /* per cgroup LRU list */
> };
> 
> We can consider how we optimize it out, later.
> (And, it's easier to debug and make development smooth.)
> 
> For example, If we can create a vmalloced-array of mem_cgroup,
> id->mem_cgroup lookup can be very fast and we can replace
> pc->mem_cgroup link with pc->mem_cgroup_id.

Nice suggestion.  Thanks Kame-san.


>>>> +unsigned long page_cgroup_get_owner(struct page *page);
>>>> +int page_cgroup_set_owner(struct page *page, unsigned long id);
>>>> +int page_cgroup_copy_owner(struct page *npage, struct page *opage);
>>>> +
>>>>   void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
>>>>
>>>>   #ifdef CONFIG_SPARSEMEM
>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>> index 2e40f2f..337ee01 100644
>>>> --- a/init/Kconfig
>>>> +++ b/init/Kconfig
>>>> @@ -650,7 +650,7 @@ endif # CGROUPS
>>>>
>>>>   config CGROUP_PAGE
>>>>   	def_bool y
>>>> -	depends on CGROUP_MEM_RES_CTLR
>>>> +	depends on CGROUP_MEM_RES_CTLR || GROUP_IOSCHED_ASYNC
>>>>
>>>>   config MM_OWNER
>>>>   	bool
>>>> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
>>>> index 6c00814..69e080c 100644
>>>> --- a/mm/page_cgroup.c
>>>> +++ b/mm/page_cgroup.c
>>>> @@ -9,6 +9,7 @@
>>>>   #include<linux/vmalloc.h>
>>>>   #include<linux/cgroup.h>
>>>>   #include<linux/swapops.h>
>>>> +#include<linux/blk-iotrack.h>
>>>>
>>>>   static void __meminit
>>>>   __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
>>>> @@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
>>>>
>>>>   	int nid, fail;
>>>>
>>>> -	if (mem_cgroup_disabled())
>>>> +	if (mem_cgroup_disabled()&&  blk_iotrack_disabled())
>>>>   		return;
>>>>
>>>>   	for_each_online_node(nid)  {
>>>> @@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
>>>>   			goto fail;
>>>>   	}
>>>>   	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
>>>> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
>>>> -	" don't want memory cgroups\n");
>>>> +	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
>>>> +	" if you don't want memory and blkio cgroups\n");
>>>>   	return;
>>>>   fail:
>>>>   	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
>>>> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
>>>> +	printk(KERN_CRIT
>>>> +		"please try 'cgroup_disable=memory,blkio' boot option\n");
>>>>   	panic("Out of memory");
>>>>   }
>>> Hmm, io-track is always done if blkio-cgroup is used, Right ?
>> No, iotrack is enabled only when CONFIG_GROUP_IOSCHED_ASYNC=y.
>> If =n, iotrack is disabled even if blkio-cgroup is enabled.
>>
> 
> 
>>> Then, why you have extra config as CONFIG_GROUP_IOSCHED_ASYNC ?
>>> Is it necessary ?
>> Current purpose of the option is only for debug because it is
>> experimental functionality.
>> It can be removed if this work is completed, or dynamic switch
>> might be useful.
>>
>> In fact, just "cgroup_disable=memory" is enough for the failure
>> case.  Let me think right messages.
>>
> 
> IMHO, once you add boot-option or sysctl, it's very hard to remove it, lator.
> So, if you think you'll remove it lator, don't add it or just add CONFIG.

OK.  I understand we need to seriously think over to add a new boot option.


Thanks,
Muuhh


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-14 14:29               ` Vivek Goyal
@ 2010-07-15  0:00                 ` KAMEZAWA Hiroyuki
  2010-07-16 13:43                   ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-15  0:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Wed, 14 Jul 2010 10:29:19 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > Cgroup's feature as mounting several subsystems at a mount point at once
> > is very useful in many case.
> 
> I agree that it is useful but if some controllers are not supporting
> hierarchy, it just adds to more confusion. And later when hierarchy
> support comes in, there will be additional issue of keeping this file
> "use_hierarchy" like memory controller.
> 
> So at this point of time , I am not too inclined towards allowing hierarchical
> cgroup creation but treating them as flat in CFQ. I think it adds to the
> confusion and user space should handle this situation.
> 

Hmm. 

Could you fix error code in create blkio cgroup ? It returns -EINVAL now.
IIUC, mkdir(2) doesn't return -EINVAL as error code (from man.)
Then, it's very confusing. I think -EPERM or -ENOMEM will be much better.

Anyway, I need to see source code of blk-cgroup.c to know why libvirt fails
to create cgroup. Where is the user-visible information (in RHEL or Fedora)
about "you can't use blkio-cgroup via libvirt or libcgroup" ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-15  0:00                 ` KAMEZAWA Hiroyuki
@ 2010-07-16 13:43                   ` Vivek Goyal
  2010-07-16 14:15                     ` Daniel P. Berrange
  0 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-16 13:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nauman Rafique, Munehiro Ikeda, linux-kernel, Ryo Tsuruta, taka,
	Andrea Righi, Gui Jianfeng, akpm, balbir, Daniel P. Berrange

On Thu, Jul 15, 2010 at 09:00:48AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 14 Jul 2010 10:29:19 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > > 
> > > Cgroup's feature as mounting several subsystems at a mount point at once
> > > is very useful in many case.
> > 
> > I agree that it is useful but if some controllers are not supporting
> > hierarchy, it just adds to more confusion. And later when hierarchy
> > support comes in, there will be additional issue of keeping this file
> > "use_hierarchy" like memory controller.
> > 
> > So at this point of time , I am not too inclined towards allowing hierarchical
> > cgroup creation but treating them as flat in CFQ. I think it adds to the
> > confusion and user space should handle this situation.
> > 
> 
> Hmm. 
> 
> Could you fix error code in create blkio cgroup ? It returns -EINVAL now.
> IIUC, mkdir(2) doesn't return -EINVAL as error code (from man.)
> Then, it's very confusing. I think -EPERM or -ENOMEM will be much better.

Hm..., Probably -EPERM is somewhat close to what we are doing. File system
does supoort creation of directories but not after certain level.

I will trace more instances of mkdir error values.

> 
> Anyway, I need to see source code of blk-cgroup.c to know why libvirt fails
> to create cgroup.

[CCing daniel berrange]

AFAIK, libvirt does not have support for blkio controller yet. Are you 
trying to introduce that? 

libvirt creates a direcotry tree. I think /cgroup/libvirt/qemu/kvm-dirs.
So actual virtual machine directors are 2-3 level below and that would
explain that if you try to use blkio controller with libvirt, it will fail
because it will not be able to create directories at that level.

I think libvirt need to special case blkio here to create directories in 
top level. It is odd but really there are no easy answeres. Will we not
support a controller in libvirt till controller support hierarchy.

> Where is the user-visible information (in RHEL or Fedora)
> about "you can't use blkio-cgroup via libvirt or libcgroup" ?

[CCing balbir]

I think with libcgroup you can use blkio controller. I know somebody
who was using cgexec command to launch some jobs in blkio cgroups. AFAIK,
libcgroup does not have too much controller specific state and should
not require any modifications for blkio controller. 

Balbir can tell us more.

libvirt will require modification to support blkio controller. I also 
noticed that libvirt by default puts every virtual machine into its
own cgroup. I think it might not be a very good strategy for blkio
controller because putting every virtual machine in its own cgroup
will kill overall throughput if each virtual machine is not driving
enough IO.

I am also trying to come up with some additional logic of letting go 
fairness if a group is not doing sufficient IO.

Daniel, do you know where is the documentation which says what controllers
are currently supported by libvirt.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-16 13:43                   ` Vivek Goyal
@ 2010-07-16 14:15                     ` Daniel P. Berrange
  2010-07-16 14:35                       ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: Daniel P. Berrange @ 2010-07-16 14:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 16, 2010 at 09:43:53AM -0400, Vivek Goyal wrote:
> On Thu, Jul 15, 2010 at 09:00:48AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 14 Jul 2010 10:29:19 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > 
> > > > Cgroup's feature as mounting several subsystems at a mount point at once
> > > > is very useful in many case.
> > > 
> > > I agree that it is useful but if some controllers are not supporting
> > > hierarchy, it just adds to more confusion. And later when hierarchy
> > > support comes in, there will be additional issue of keeping this file
> > > "use_hierarchy" like memory controller.
> > > 
> > > So at this point of time , I am not too inclined towards allowing hierarchical
> > > cgroup creation but treating them as flat in CFQ. I think it adds to the
> > > confusion and user space should handle this situation.
> > > 
> > 
> > Hmm. 
> > 
> > Could you fix error code in create blkio cgroup ? It returns -EINVAL now.
> > IIUC, mkdir(2) doesn't return -EINVAL as error code (from man.)
> > Then, it's very confusing. I think -EPERM or -ENOMEM will be much better.
> 
> Hm..., Probably -EPERM is somewhat close to what we are doing. File system
> does supoort creation of directories but not after certain level.
> 
> I will trace more instances of mkdir error values.
> 
> > 
> > Anyway, I need to see source code of blk-cgroup.c to know why libvirt fails
> > to create cgroup.
> 
> [CCing daniel berrange]
> 
> AFAIK, libvirt does not have support for blkio controller yet. Are you 
> trying to introduce that? 
> 
> libvirt creates a direcotry tree. I think /cgroup/libvirt/qemu/kvm-dirs.
> So actual virtual machine directors are 2-3 level below and that would
> explain that if you try to use blkio controller with libvirt, it will fail
> because it will not be able to create directories at that level.

Yes, we use a hierarchy to deal with namespace uniqueness. The
first step is to determine where libvirtd process is placed. This
may be the root cgroup, but it may already be one or more levels
down due to the init system (sysv-init, upstart, systemd etc)
startup policy. Once that's determined we create a 'libvirt' 
cgroup which acts as container for everything run by libvirtd.
At the next level is the driver name (qemu, lxc, uml). This allows
confinement of all guests for a particular driver and gives us
a unique namespace for the next level where we have a directory
per guest. This last level is where libvirt actually sets tunables
normally. The higher levels are for administrator use.

  $ROOT  (where libvirtd process is, not the root mount point)
   |
   +- libvirt
       |
       +- qemu
       |   |
       |   +- guest1
       |   +- guest2
       |   +- guest3
       |   ...
       |
       +- lxc
           +- guest1
           +- guest2
           +- guest3
           ...


> I think libvirt need to special case blkio here to create directories in 
> top level. It is odd but really there are no easy answeres. Will we not
> support a controller in libvirt till controller support hierarchy.

We explicitly avoided creating anything at the top level. We always
detect where the libvirtd process has been placed & only ever create
stuff below that point. This ensures the host admin can set overall
limits for virt on a host, and not have libvirt side-step these limits
by jumping back upto the root cgroup.

> > Where is the user-visible information (in RHEL or Fedora)
> > about "you can't use blkio-cgroup via libvirt or libcgroup" ?
> 
> [CCing balbir]
> 
> I think with libcgroup you can use blkio controller. I know somebody
> who was using cgexec command to launch some jobs in blkio cgroups. AFAIK,
> libcgroup does not have too much controller specific state and should
> not require any modifications for blkio controller. 
> 
> Balbir can tell us more.
> 
> libvirt will require modification to support blkio controller. I also 
> noticed that libvirt by default puts every virtual machine into its
> own cgroup. I think it might not be a very good strategy for blkio
> controller because putting every virtual machine in its own cgroup
> will kill overall throughput if each virtual machine is not driving
> enough IO.

A requirement todo everything in the top level and not use a hiearchy
for blkio makes this a pretty unfriendly controller to use. It seriously
limits flexibility of what libvirt and host administrators can do and
means we can't effectively split poilicy between them. It also means
that if the blkio contorller were ever mounted at same point as another
controller, you'd loose the hierarchy support for that other controller
IMHO use of the cgroups hiearchy is key to making cgroups managable for
applications. We can't have many different applications on a system
all having to create many directories at the top level.

> I am also trying to come up with some additional logic of letting go 
> fairness if a group is not doing sufficient IO.
> 
> Daniel, do you know where is the documentation which says what controllers
> are currently supported by libvirt.

We use cpu, cpuacct, cpuset, memory, devices & freezer currently. 

Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-16 14:15                     ` Daniel P. Berrange
@ 2010-07-16 14:35                       ` Vivek Goyal
  2010-07-16 14:53                         ` Daniel P. Berrange
  0 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-16 14:35 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:

[..]
> > libvirt will require modification to support blkio controller. I also 
> > noticed that libvirt by default puts every virtual machine into its
> > own cgroup. I think it might not be a very good strategy for blkio
> > controller because putting every virtual machine in its own cgroup
> > will kill overall throughput if each virtual machine is not driving
> > enough IO.
> 
> A requirement todo everything in the top level and not use a hiearchy
> for blkio makes this a pretty unfriendly controller to use. It seriously
> limits flexibility of what libvirt and host administrators can do and
> means we can't effectively split poilicy between them. It also means
> that if the blkio contorller were ever mounted at same point as another
> controller, you'd loose the hierarchy support for that other controller
> IMHO use of the cgroups hiearchy is key to making cgroups managable for
> applications. We can't have many different applications on a system
> all having to create many directories at the top level.
> 

I understand that not having hierarchical support is a huge limitation
and in future I would like to be there. Just that at the moment provinding
that support is hard as I am struggling with more basic issues which are
more important.

Secondly, just because some controller allows creation of hierarchy does
not mean that hierarchy is being enforced. For example, memory controller.
IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
otherwise effectively it is flat. So if libvirt is creating groups and
putting machines in child groups thinking that we are not interfering
with admin's policy, is not entirely correct.

So how do we make progress here. I really want to see blkio controller
integrated with libvirt.

About the issue of hierarchy, I can probably travel down the path of allowing
creation of hierarchy but CFQ will treat it as flat. Though I don't like it
because it will force me to introduce variables like "use_hierarchy" once
real hierarchical support comes in but I guess I can live with that.
(Anyway memory controller is already doing it.).

There is another issue though and that is by default every virtual
machine going into a group of its own. As of today, it can have
severe performance penalties (depending on workload) if group is not
driving doing enough IO. (Especially with group_isolation=1).

I was thinking of a model where an admin moves out the bad virtual
machines in separate group and limit their IO.

Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-16 14:35                       ` Vivek Goyal
@ 2010-07-16 14:53                         ` Daniel P. Berrange
  2010-07-16 15:12                           ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: Daniel P. Berrange @ 2010-07-16 14:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> Secondly, just because some controller allows creation of hierarchy does
> not mean that hierarchy is being enforced. For example, memory controller.
> IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> otherwise effectively it is flat. So if libvirt is creating groups and
> putting machines in child groups thinking that we are not interfering
> with admin's policy, is not entirely correct.

That is true, but that 'use_hierarchy' at least provides admins
the mechanism required to implement the neccessary policy

> So how do we make progress here. I really want to see blkio controller
> integrated with libvirt.
> 
> About the issue of hierarchy, I can probably travel down the path of allowing
> creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> because it will force me to introduce variables like "use_hierarchy" once
> real hierarchical support comes in but I guess I can live with that.
> (Anyway memory controller is already doing it.).
> 
> There is another issue though and that is by default every virtual
> machine going into a group of its own. As of today, it can have
> severe performance penalties (depending on workload) if group is not
> driving doing enough IO. (Especially with group_isolation=1).
> 
> I was thinking of a model where an admin moves out the bad virtual
> machines in separate group and limit their IO.

In the simple / normal case I imagine all guests VMs will be running
unrestricted I/O initially. Thus instead of creating the cgroup at time
of VM startup, we could create the cgroup only when the admin actually
sets an I/O limit. IIUC, this should maintain the one cgroup per guest
model, while avoiding the performance penalty in normal use. The caveat
of course is that this would require blkio controller to have a dedicated
mount point, not shared with other controller.  I think we might also
want this kind of model for net I/O, since we probably don't want to 
creating TC classes + net_cls groups for every VM the moment it starts
unless the admin has actually set a net I/O limit.

Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-16 14:53                         ` Daniel P. Berrange
@ 2010-07-16 15:12                           ` Vivek Goyal
  2010-07-27 10:40                             ` Daniel P. Berrange
  0 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-07-16 15:12 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 16, 2010 at 03:53:09PM +0100, Daniel P. Berrange wrote:
> On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> > On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> > Secondly, just because some controller allows creation of hierarchy does
> > not mean that hierarchy is being enforced. For example, memory controller.
> > IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> > otherwise effectively it is flat. So if libvirt is creating groups and
> > putting machines in child groups thinking that we are not interfering
> > with admin's policy, is not entirely correct.
> 
> That is true, but that 'use_hierarchy' at least provides admins
> the mechanism required to implement the neccessary policy
> 
> > So how do we make progress here. I really want to see blkio controller
> > integrated with libvirt.
> > 
> > About the issue of hierarchy, I can probably travel down the path of allowing
> > creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> > because it will force me to introduce variables like "use_hierarchy" once
> > real hierarchical support comes in but I guess I can live with that.
> > (Anyway memory controller is already doing it.).
> > 
> > There is another issue though and that is by default every virtual
> > machine going into a group of its own. As of today, it can have
> > severe performance penalties (depending on workload) if group is not
> > driving doing enough IO. (Especially with group_isolation=1).
> > 
> > I was thinking of a model where an admin moves out the bad virtual
> > machines in separate group and limit their IO.
> 
> In the simple / normal case I imagine all guests VMs will be running
> unrestricted I/O initially. Thus instead of creating the cgroup at time
> of VM startup, we could create the cgroup only when the admin actually
> sets an I/O limit.

That makes sense. Run all the virtual machines by default in root group
and move out a virtual machine to a separate group of either low weight
(if virtual machine is a bad one and driving lot of IO) or of higher weight
(if we want to give more IO bw to this machine).

> IIUC, this should maintain the one cgroup per guest
> model, while avoiding the performance penalty in normal use. The caveat
> of course is that this would require blkio controller to have a dedicated
> mount point, not shared with other controller.

Yes. Because for other controllers we seem to be putting virtual machines
in separate cgroups by default at startup time. So it seems we will
require a separate mount point here for blkio controller.

>  I think we might also
> want this kind of model for net I/O, since we probably don't want to 
> creating TC classes + net_cls groups for every VM the moment it starts
> unless the admin has actually set a net I/O limit.

Looks like. So good, then network controller and blkio controller can
share the this new mount point. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-12  0:20         ` KAMEZAWA Hiroyuki
  2010-07-12 13:18           ` Vivek Goyal
@ 2010-07-22 19:28           ` Greg Thelen
  2010-07-22 23:59             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 53+ messages in thread
From: Greg Thelen @ 2010-07-22 19:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Sun, Jul 11, 2010 at 5:20 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sat, 10 Jul 2010 09:24:17 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
>> On Fri, Jul 09, 2010 at 05:55:23PM -0700, Nauman Rafique wrote:
>>
>> [..]
>> > > Well, right.  I agree.
>> > > But I think we can work parallel.  I will try to struggle on both.
>> >
>> > IMHO, we have a classic chicken and egg problem here. We should try to
>> > merge pieces as they become available. If we get to agree on patches
>> > that do async IO tracking for IO controller, we should go ahead with
>> > them instead of trying to wait for per cgroup dirty ratios.
>> >
>> > In terms of getting numbers, we have been using patches that add per
>> > cpuset dirty ratios on top of NUMA_EMU, and we get good
>> > differentiation between buffered writes as well as buffered writes vs.
>> > reads.
>> >
>> > It is really obvious that as long as flusher threads ,etc are not
>> > cgroup aware, differentiation for buffered writes would not be perfect
>> > in all cases, but this is a step in the right direction and we should
>> > go for it.
>>
>> Working parallel on two separate pieces is fine. But pushing second piece
>> in first does not make much sense to me because second piece does not work
>> if first piece is not in. There is no way to test it. What's the point of
>> pushing a code in kernel which only compiles but does not achieve intented
>> purposes because some other pieces are missing.
>>
>> Per cgroup dirty ratio is a little hard problem and few attempts have
>> already been made at it. IMHO, we need to first work on that piece and
>> get it inside the kernel and then work on IO tracking patches. Lets
>> fix the hard problem first that is necessary to make second set of patches
>> work.
>>
>
> I've just waited for dirty-ratio patches because I know someone is working on.
> But, hmm, I'll consider to start work by myself.

I have some patches that I have to address the dirty-ratios.  I will post them.

These dirty-ratio patches do not do anything intelligent wrt to
per-cgroup writeback.  When a cgroup dirty ratio is exceeded, a
per-bdi writeback is triggered.

> (Off-topic)
> BTW, why io-cgroup's hierarchy level is limited to 2 ?
> Because of that limitation, libvirt can't work well...
>
> Thanks,
> -Kame
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-22 19:28           ` Greg Thelen
@ 2010-07-22 23:59             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-22 23:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Vivek Goyal, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Thu, 22 Jul 2010 12:28:50 -0700
Greg Thelen <gthelen@google.com> wrote:

> > I've just waited for dirty-ratio patches because I know someone is working on.
> > But, hmm, I'll consider to start work by myself.
> 
> I have some patches that I have to address the dirty-ratios.  I will post them.
> 

please wait until my proposal to implement a light-weight lock-less update_stat()
I'll handle FILE_MAPPED in it and add a generic interface for updating statistics
in mem_cgroup via page_cgroup.
(I'll post it today if I can, IOW, I'm lucky.)

If not, we have to discuss the same thing again in a hell.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (12 preceding siblings ...)
  2010-07-09 13:45 ` Vivek Goyal
@ 2010-07-26  6:41 ` Balbir Singh
  2010-07-27  6:40   ` Greg Thelen
  2010-08-02 20:58 ` Vivek Goyal
  14 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2010-07-26  6:41 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Andrea Righi, Gui Jianfeng, akpm

* Munihiro Ikeda <m-ikeda@ds.jp.nec.com> [2010-07-08 22:57:13]:

> These RFC patches are trial to add async (cached) write support on blkio
> controller.
> 
> Only test which has been done is to compile, boot, and that write bandwidth
> seems prioritized when pages which were dirtied by two different processes in
> different cgroups are written back to a device simultaneously.  I know this
> is the minimum (or less) test but I posted this as RFC because I would like
> to hear your opinions about the design direction in the early stage.
> 
> Patches are for 2.6.35-rc4.
> 
> This patch series consists of two chunks.
> 
> (1) iotrack (patch 01/11 -- 06/11)
> 
> This is a functionality to track who dirtied a page, in exact which cgroup a
> process which dirtied a page belongs to.  Blkio controller will read the info
> later and prioritize when the page is actually written to a block device.
> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
> proposals for IO controller.
>

Does this reuse the memcg infrastructure, if so could you please add a
summary of the changes here.
 
> 
> (2) blkio controller modification (07/11 -- 11/11)
> 
> The main part of blkio controller async write support.
> Currently async queues are device-wide and async write IOs are always treated
> as root group.
> These patches make async queues per a cfq_group per a device to control them.
> Async write is handled by flush kernel thread.  Because queue pointers are
> stored in cfq_io_context, io_context of the thread has to have multiple
> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
> io_context per a cfq_group, which means per an io_context per a cgroup per a
> device.
> 
> 
> This might be a piece of puzzle for complete async write support of blkio
> controller.  One of other pieces in my head is page dirtying ratio control.
> I believe Andrea Righi was working on it...how about the situation?
> 

Greg posted the last set of patches, we are yet to see another
iteration.

> And also, I'm thinking that async write support is required by bandwidth
> capping policy of blkio controller.  Bandwidth capping can be done in upper
> layer than elevator.  However I think it should be also done in elevator layer
> in my opinion.  Elevator buffers and sort requests.  If there is another
> buffering functionality in upper layer, it is doubled buffering and it can be
> harmful for elevator's prediction.
>


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller
  2010-07-09  3:14 ` [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller Munehiro Ikeda
@ 2010-07-26  6:49   ` Balbir Singh
  0 siblings, 0 replies; 53+ messages in thread
From: Balbir Singh @ 2010-07-26  6:49 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Andrea Righi, Gui Jianfeng, akpm

* Munihiro Ikeda <m-ikeda@ds.jp.nec.com> [2010-07-08 23:14:57]:

> This patch makes page_cgroup independent from memory controller
> so that kernel functionalities other than memory controller can
> use page_cgroup.
> 
> This patch is based on a patch posted from Ryo Tsuruta on Oct 2,
> 2009 titled "The new page_cgroup framework".
> 
> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
> Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>

Seems reasonable to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-27  6:40   ` Greg Thelen
@ 2010-07-27  6:39     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-27  6:39 UTC (permalink / raw)
  To: Greg Thelen
  Cc: balbir, Munehiro Ikeda, linux-kernel, jens.axboe, Vivek Goyal,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm

On Mon, 26 Jul 2010 23:40:07 -0700
Greg Thelen <gthelen@google.com> wrote:

> On Sun, Jul 25, 2010 at 11:41 PM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
> > * Munihiro Ikeda <m-ikeda@ds.jp.nec.com> [2010-07-08 22:57:13]:
> >
> >> These RFC patches are trial to add async (cached) write support on blkio
> >> controller.
> >>
> >> Only test which has been done is to compile, boot, and that write bandwidth
> >> seems prioritized when pages which were dirtied by two different processes in
> >> different cgroups are written back to a device simultaneously.  I know this
> >> is the minimum (or less) test but I posted this as RFC because I would like
> >> to hear your opinions about the design direction in the early stage.
> >>
> >> Patches are for 2.6.35-rc4.
> >>
> >> This patch series consists of two chunks.
> >>
> >> (1) iotrack (patch 01/11 -- 06/11)
> >>
> >> This is a functionality to track who dirtied a page, in exact which cgroup a
> >> process which dirtied a page belongs to.  Blkio controller will read the info
> >> later and prioritize when the page is actually written to a block device.
> >> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> >> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
> >> proposals for IO controller.
> >>
> >
> > Does this reuse the memcg infrastructure, if so could you please add a
> > summary of the changes here.
> >
> >>
> >> (2) blkio controller modification (07/11 -- 11/11)
> >>
> >> The main part of blkio controller async write support.
> >> Currently async queues are device-wide and async write IOs are always treated
> >> as root group.
> >> These patches make async queues per a cfq_group per a device to control them.
> >> Async write is handled by flush kernel thread.  Because queue pointers are
> >> stored in cfq_io_context, io_context of the thread has to have multiple
> >> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
> >> io_context per a cfq_group, which means per an io_context per a cgroup per a
> >> device.
> >>
> >>
> >> This might be a piece of puzzle for complete async write support of blkio
> >> controller.  One of other pieces in my head is page dirtying ratio control.
> >> I believe Andrea Righi was working on it...how about the situation?
> >>
> >
> > Greg posted the last set of patches, we are yet to see another
> > iteration.
> 
> I am waiting to post the next iteration of memcg dirty limits and ratios until
> Kame-san posts light-weight lockless update_stat().  I can post the dirty ratio
> patches before the lockless updates are available, but I imagine there will be
> a significant merge.  So I prefer to wait, assuming that thee changes will be
> coming in the near future.
> 
will post RFC version today.

But maybe needs some sort-out..I'll CC you.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-26  6:41 ` Balbir Singh
@ 2010-07-27  6:40   ` Greg Thelen
  2010-07-27  6:39     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: Greg Thelen @ 2010-07-27  6:40 UTC (permalink / raw)
  To: balbir
  Cc: Munehiro Ikeda, linux-kernel, jens.axboe, Vivek Goyal,
	Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi, Gui Jianfeng,
	akpm

On Sun, Jul 25, 2010 at 11:41 PM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
> * Munihiro Ikeda <m-ikeda@ds.jp.nec.com> [2010-07-08 22:57:13]:
>
>> These RFC patches are trial to add async (cached) write support on blkio
>> controller.
>>
>> Only test which has been done is to compile, boot, and that write bandwidth
>> seems prioritized when pages which were dirtied by two different processes in
>> different cgroups are written back to a device simultaneously.  I know this
>> is the minimum (or less) test but I posted this as RFC because I would like
>> to hear your opinions about the design direction in the early stage.
>>
>> Patches are for 2.6.35-rc4.
>>
>> This patch series consists of two chunks.
>>
>> (1) iotrack (patch 01/11 -- 06/11)
>>
>> This is a functionality to track who dirtied a page, in exact which cgroup a
>> process which dirtied a page belongs to.  Blkio controller will read the info
>> later and prioritize when the page is actually written to a block device.
>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
>> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
>> proposals for IO controller.
>>
>
> Does this reuse the memcg infrastructure, if so could you please add a
> summary of the changes here.
>
>>
>> (2) blkio controller modification (07/11 -- 11/11)
>>
>> The main part of blkio controller async write support.
>> Currently async queues are device-wide and async write IOs are always treated
>> as root group.
>> These patches make async queues per a cfq_group per a device to control them.
>> Async write is handled by flush kernel thread.  Because queue pointers are
>> stored in cfq_io_context, io_context of the thread has to have multiple
>> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
>> io_context per a cfq_group, which means per an io_context per a cgroup per a
>> device.
>>
>>
>> This might be a piece of puzzle for complete async write support of blkio
>> controller.  One of other pieces in my head is page dirtying ratio control.
>> I believe Andrea Righi was working on it...how about the situation?
>>
>
> Greg posted the last set of patches, we are yet to see another
> iteration.

I am waiting to post the next iteration of memcg dirty limits and ratios until
Kame-san posts light-weight lockless update_stat().  I can post the dirty ratio
patches before the lockless updates are available, but I imagine there will be
a significant merge.  So I prefer to wait, assuming that thee changes will be
coming in the near future.

>> And also, I'm thinking that async write support is required by bandwidth
>> capping policy of blkio controller.  Bandwidth capping can be done in upper
>> layer than elevator.  However I think it should be also done in elevator layer
>> in my opinion.  Elevator buffers and sort requests.  If there is another
>> buffering functionality in upper layer, it is doubled buffering and it can be
>> harmful for elevator's prediction.
>>
>
>
> --
>        Three Cheers,
>        Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-16 15:12                           ` Vivek Goyal
@ 2010-07-27 10:40                             ` Daniel P. Berrange
  2010-07-27 14:03                               ` Vivek Goyal
  0 siblings, 1 reply; 53+ messages in thread
From: Daniel P. Berrange @ 2010-07-27 10:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Jul 16, 2010 at 11:12:34AM -0400, Vivek Goyal wrote:
> On Fri, Jul 16, 2010 at 03:53:09PM +0100, Daniel P. Berrange wrote:
> > On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> > > On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> > > Secondly, just because some controller allows creation of hierarchy does
> > > not mean that hierarchy is being enforced. For example, memory controller.
> > > IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> > > otherwise effectively it is flat. So if libvirt is creating groups and
> > > putting machines in child groups thinking that we are not interfering
> > > with admin's policy, is not entirely correct.
> > 
> > That is true, but that 'use_hierarchy' at least provides admins
> > the mechanism required to implement the neccessary policy
> > 
> > > So how do we make progress here. I really want to see blkio controller
> > > integrated with libvirt.
> > > 
> > > About the issue of hierarchy, I can probably travel down the path of allowing
> > > creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> > > because it will force me to introduce variables like "use_hierarchy" once
> > > real hierarchical support comes in but I guess I can live with that.
> > > (Anyway memory controller is already doing it.).
> > > 
> > > There is another issue though and that is by default every virtual
> > > machine going into a group of its own. As of today, it can have
> > > severe performance penalties (depending on workload) if group is not
> > > driving doing enough IO. (Especially with group_isolation=1).
> > > 
> > > I was thinking of a model where an admin moves out the bad virtual
> > > machines in separate group and limit their IO.
> > 
> > In the simple / normal case I imagine all guests VMs will be running
> > unrestricted I/O initially. Thus instead of creating the cgroup at time
> > of VM startup, we could create the cgroup only when the admin actually
> > sets an I/O limit.
> 
> That makes sense. Run all the virtual machines by default in root group
> and move out a virtual machine to a separate group of either low weight
> (if virtual machine is a bad one and driving lot of IO) or of higher weight
> (if we want to give more IO bw to this machine).
> 
> > IIUC, this should maintain the one cgroup per guest
> > model, while avoiding the performance penalty in normal use. The caveat
> > of course is that this would require blkio controller to have a dedicated
> > mount point, not shared with other controller.
> 
> Yes. Because for other controllers we seem to be putting virtual machines
> in separate cgroups by default at startup time. So it seems we will
> require a separate mount point here for blkio controller.
> 
> >  I think we might also
> > want this kind of model for net I/O, since we probably don't want to 
> > creating TC classes + net_cls groups for every VM the moment it starts
> > unless the admin has actually set a net I/O limit.
> 
> Looks like. So good, then network controller and blkio controller can
> share the this new mount point. 

After thinking about this some more there are a couple of problems with
this plan. For QEMU the 'vhostnet' (the in kernel virtio network backend)
requires that QEMU be in the cgroup at time of startup, otherwise the
vhost kernel thread won't end up in the right cgroup. For libvirt's LXC
container driver, moving the container in & out of the cgroups at runtime
is pretty difficult because there are an arbitrary number of processes
running in the container. It would require moving all the container
processes between two cgroups in a race free manner. So on second thoughts
I'm more inclined to stick with our current approach of putting all guests
into the appropriate cgroups at guest/container startup, even for blkio
and netcls. 

Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-27 10:40                             ` Daniel P. Berrange
@ 2010-07-27 14:03                               ` Vivek Goyal
  0 siblings, 0 replies; 53+ messages in thread
From: Vivek Goyal @ 2010-07-27 14:03 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: KAMEZAWA Hiroyuki, Nauman Rafique, Munehiro Ikeda, linux-kernel,
	Ryo Tsuruta, taka, Andrea Righi, Gui Jianfeng, akpm, balbir

On Tue, Jul 27, 2010 at 11:40:37AM +0100, Daniel P. Berrange wrote:
> On Fri, Jul 16, 2010 at 11:12:34AM -0400, Vivek Goyal wrote:
> > On Fri, Jul 16, 2010 at 03:53:09PM +0100, Daniel P. Berrange wrote:
> > > On Fri, Jul 16, 2010 at 10:35:36AM -0400, Vivek Goyal wrote:
> > > > On Fri, Jul 16, 2010 at 03:15:49PM +0100, Daniel P. Berrange wrote:
> > > > Secondly, just because some controller allows creation of hierarchy does
> > > > not mean that hierarchy is being enforced. For example, memory controller.
> > > > IIUC, one needs to explicitly set "use_hierarchy" to enforce hierarchy
> > > > otherwise effectively it is flat. So if libvirt is creating groups and
> > > > putting machines in child groups thinking that we are not interfering
> > > > with admin's policy, is not entirely correct.
> > > 
> > > That is true, but that 'use_hierarchy' at least provides admins
> > > the mechanism required to implement the neccessary policy
> > > 
> > > > So how do we make progress here. I really want to see blkio controller
> > > > integrated with libvirt.
> > > > 
> > > > About the issue of hierarchy, I can probably travel down the path of allowing
> > > > creation of hierarchy but CFQ will treat it as flat. Though I don't like it
> > > > because it will force me to introduce variables like "use_hierarchy" once
> > > > real hierarchical support comes in but I guess I can live with that.
> > > > (Anyway memory controller is already doing it.).
> > > > 
> > > > There is another issue though and that is by default every virtual
> > > > machine going into a group of its own. As of today, it can have
> > > > severe performance penalties (depending on workload) if group is not
> > > > driving doing enough IO. (Especially with group_isolation=1).
> > > > 
> > > > I was thinking of a model where an admin moves out the bad virtual
> > > > machines in separate group and limit their IO.
> > > 
> > > In the simple / normal case I imagine all guests VMs will be running
> > > unrestricted I/O initially. Thus instead of creating the cgroup at time
> > > of VM startup, we could create the cgroup only when the admin actually
> > > sets an I/O limit.
> > 
> > That makes sense. Run all the virtual machines by default in root group
> > and move out a virtual machine to a separate group of either low weight
> > (if virtual machine is a bad one and driving lot of IO) or of higher weight
> > (if we want to give more IO bw to this machine).
> > 
> > > IIUC, this should maintain the one cgroup per guest
> > > model, while avoiding the performance penalty in normal use. The caveat
> > > of course is that this would require blkio controller to have a dedicated
> > > mount point, not shared with other controller.
> > 
> > Yes. Because for other controllers we seem to be putting virtual machines
> > in separate cgroups by default at startup time. So it seems we will
> > require a separate mount point here for blkio controller.
> > 
> > >  I think we might also
> > > want this kind of model for net I/O, since we probably don't want to 
> > > creating TC classes + net_cls groups for every VM the moment it starts
> > > unless the admin has actually set a net I/O limit.
> > 
> > Looks like. So good, then network controller and blkio controller can
> > share the this new mount point. 
> 
> After thinking about this some more there are a couple of problems with
> this plan. For QEMU the 'vhostnet' (the in kernel virtio network backend)
> requires that QEMU be in the cgroup at time of startup, otherwise the
> vhost kernel thread won't end up in the right cgroup.

Not sure why this limitation is there in vhostnet.

> For libvirt's LXC
> container driver, moving the container in & out of the cgroups at runtime
> is pretty difficult because there are an arbitrary number of processes
> running in the container.

So once a container is created, we don't have the capability to move
it around cgroups? One needs to shutdown the container and relaunch it
in desired container.

> It would require moving all the container
> processes between two cgroups in a race free manner. So on second thoughts
> I'm more inclined to stick with our current approach of putting all guests
> into the appropriate cgroups at guest/container startup, even for blkio
> and netcls. 

In the current code form, it is a bad idea from "blkio" perspective. Very
often, a virtual machine might not be driving enough IO and we will see
overall decreased throughput. That's why I was preferring to move out
the virtual machine out of cgroup only if required.

I was also thinking of implementing a new tunable in CFQ, something like
"min_queue_depth". It would mean that don't idle on groups if we are not
driving a min_queue_depth. Higher the "min_queue_depth", lower the isolation
between groups. But this will take effect only if slice_idle=0 and that
would be done only on higher end storage.

IOW, I am experimenting with above bits, but I certainly would not
recommend putting virtual machines/containers in their own blkio cgroup
by default. 

How about not comounting blkio and net_cls. So for network, you can
continue to put virtual machine in a cgroup of its own and that should
take care of vhostnet issue. For blkio, we will continue to put virtual
machines in common root group.

For container driver issue, we need to figure out how to move around
containers in cgroups. Not sure how hard that is though.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
                   ` (13 preceding siblings ...)
  2010-07-26  6:41 ` Balbir Singh
@ 2010-08-02 20:58 ` Vivek Goyal
  2010-08-03 14:31   ` Munehiro Ikeda
  14 siblings, 1 reply; 53+ messages in thread
From: Vivek Goyal @ 2010-08-02 20:58 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
> These RFC patches are trial to add async (cached) write support on blkio
> controller.
> 
> Only test which has been done is to compile, boot, and that write bandwidth
> seems prioritized when pages which were dirtied by two different processes in
> different cgroups are written back to a device simultaneously.  I know this
> is the minimum (or less) test but I posted this as RFC because I would like
> to hear your opinions about the design direction in the early stage.
> 
> Patches are for 2.6.35-rc4.
> 
> This patch series consists of two chunks.
> 
> (1) iotrack (patch 01/11 -- 06/11)
> 
> This is a functionality to track who dirtied a page, in exact which cgroup a
> process which dirtied a page belongs to.  Blkio controller will read the info
> later and prioritize when the page is actually written to a block device.
> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
> proposals for IO controller.
> 
> 
> (2) blkio controller modification (07/11 -- 11/11)
> 
> The main part of blkio controller async write support.
> Currently async queues are device-wide and async write IOs are always treated
> as root group.
> These patches make async queues per a cfq_group per a device to control them.
> Async write is handled by flush kernel thread.  Because queue pointers are
> stored in cfq_io_context, io_context of the thread has to have multiple
> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
> io_context per a cfq_group, which means per an io_context per a cgroup per a
> device.
> 
> 

Muuh,

You will require one more piece and that is support for per cgroup request
descriptors on request queue. With writes, it is so easy to consume those
128 request descriptors.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-08-02 20:58 ` Vivek Goyal
@ 2010-08-03 14:31   ` Munehiro Ikeda
  2010-08-03 19:24     ` Nauman Rafique
  2010-08-03 20:15     ` Vivek Goyal
  0 siblings, 2 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-08-03 14:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

Vivek Goyal wrote, on 08/02/2010 04:58 PM:
> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>> These RFC patches are trial to add async (cached) write support on blkio
>> controller.
>>
>> Only test which has been done is to compile, boot, and that write bandwidth
>> seems prioritized when pages which were dirtied by two different processes in
>> different cgroups are written back to a device simultaneously.  I know this
>> is the minimum (or less) test but I posted this as RFC because I would like
>> to hear your opinions about the design direction in the early stage.
>>
>> Patches are for 2.6.35-rc4.
>>
>> This patch series consists of two chunks.
>>
>> (1) iotrack (patch 01/11 -- 06/11)
>>
>> This is a functionality to track who dirtied a page, in exact which cgroup a
>> process which dirtied a page belongs to.  Blkio controller will read the info
>> later and prioritize when the page is actually written to a block device.
>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and includes
>> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one of
>> proposals for IO controller.
>>
>>
>> (2) blkio controller modification (07/11 -- 11/11)
>>
>> The main part of blkio controller async write support.
>> Currently async queues are device-wide and async write IOs are always treated
>> as root group.
>> These patches make async queues per a cfq_group per a device to control them.
>> Async write is handled by flush kernel thread.  Because queue pointers are
>> stored in cfq_io_context, io_context of the thread has to have multiple
>> cfq_io_contexts per a device.  So these patches make cfq_io_context per an
>> io_context per a cfq_group, which means per an io_context per a cgroup per a
>> device.
>>
>>
>
> Muuh,
>
> You will require one more piece and that is support for per cgroup request
> descriptors on request queue. With writes, it is so easy to consume those
> 128 request descriptors.

Hi Vivek,

Yes.  Thank you for the comment.
I have two concerns to do that.

(1) technical concern
If there is fixed device-wide limitation and there are so many groups,
the number of request descriptors distributed to each group can be too
few.  My only idea for this is to make device-wide limitation flexible,
but I'm not sure if it is the best or even can be allowed.

(2) implementation concern
Now the limitation is done by generic block layer which doesn't know
about grouping.  The idea in my head to solve this is to add a new
interface on elevator_ops to ask IO scheduler if a new request can
be allocated.

Anyway, simple RFC patch first and testing it would be preferable,
I think.


Thanks,
Muuhh


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-08-03 14:31   ` Munehiro Ikeda
@ 2010-08-03 19:24     ` Nauman Rafique
  2010-08-04 14:32       ` Munehiro Ikeda
  2010-08-03 20:15     ` Vivek Goyal
  1 sibling, 1 reply; 53+ messages in thread
From: Nauman Rafique @ 2010-08-03 19:24 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Tue, Aug 3, 2010 at 7:31 AM, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 08/02/2010 04:58 PM:
>>
>> On Thu, Jul 08, 2010 at 10:57:13PM -0400, Munehiro Ikeda wrote:
>>>
>>> These RFC patches are trial to add async (cached) write support on blkio
>>> controller.
>>>
>>> Only test which has been done is to compile, boot, and that write
>>> bandwidth
>>> seems prioritized when pages which were dirtied by two different
>>> processes in
>>> different cgroups are written back to a device simultaneously.  I know
>>> this
>>> is the minimum (or less) test but I posted this as RFC because I would
>>> like
>>> to hear your opinions about the design direction in the early stage.
>>>
>>> Patches are for 2.6.35-rc4.
>>>
>>> This patch series consists of two chunks.
>>>
>>> (1) iotrack (patch 01/11 -- 06/11)
>>>
>>> This is a functionality to track who dirtied a page, in exact which
>>> cgroup a
>>> process which dirtied a page belongs to.  Blkio controller will read the
>>> info
>>> later and prioritize when the page is actually written to a block device.
>>> This work is originated from Ryo Tsuruta and Hirokazu Takahashi and
>>> includes
>>> Andrea Righi's idea.  It was posted as a part of dm-ioband which was one
>>> of
>>> proposals for IO controller.
>>>
>>>
>>> (2) blkio controller modification (07/11 -- 11/11)
>>>
>>> The main part of blkio controller async write support.
>>> Currently async queues are device-wide and async write IOs are always
>>> treated
>>> as root group.
>>> These patches make async queues per a cfq_group per a device to control
>>> them.
>>> Async write is handled by flush kernel thread.  Because queue pointers
>>> are
>>> stored in cfq_io_context, io_context of the thread has to have multiple
>>> cfq_io_contexts per a device.  So these patches make cfq_io_context per
>>> an
>>> io_context per a cfq_group, which means per an io_context per a cgroup
>>> per a
>>> device.
>>>
>>>
>>
>> Muuh,
>>
>> You will require one more piece and that is support for per cgroup request
>> descriptors on request queue. With writes, it is so easy to consume those
>> 128 request descriptors.
>
> Hi Vivek,
>
> Yes.  Thank you for the comment.
> I have two concerns to do that.
>
> (1) technical concern
> If there is fixed device-wide limitation and there are so many groups,
> the number of request descriptors distributed to each group can be too
> few.  My only idea for this is to make device-wide limitation flexible,
> but I'm not sure if it is the best or even can be allowed.
>
> (2) implementation concern
> Now the limitation is done by generic block layer which doesn't know
> about grouping.  The idea in my head to solve this is to add a new
> interface on elevator_ops to ask IO scheduler if a new request can
> be allocated.

Muuhh,
We have already done the work of forward porting the request
descriptor patch that Vivek had in his earlier patch sets. We also
taken care of the two concerns you have mentioned above. We have been
testing it, and getting good numbers. So if you want, I can send the
patch your way so it can be included in this same patch series.

Thanks.

>
> Anyway, simple RFC patch first and testing it would be preferable,
> I think.
>
>
> Thanks,
> Muuhh
>
>
> --
> IKEDA, Munehiro
>  NEC Corporation of America
>    m-ikeda@ds.jp.nec.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-08-03 14:31   ` Munehiro Ikeda
  2010-08-03 19:24     ` Nauman Rafique
@ 2010-08-03 20:15     ` Vivek Goyal
  1 sibling, 0 replies; 53+ messages in thread
From: Vivek Goyal @ 2010-08-03 20:15 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu, Andrea Righi,
	Gui Jianfeng, akpm, balbir

On Tue, Aug 03, 2010 at 10:31:33AM -0400, Munehiro Ikeda wrote:

[..]
> >Muuh,
> >
> >You will require one more piece and that is support for per cgroup request
> >descriptors on request queue. With writes, it is so easy to consume those
> >128 request descriptors.
> 
> Hi Vivek,
> 
> Yes.  Thank you for the comment.
> I have two concerns to do that.
> 
> (1) technical concern
> If there is fixed device-wide limitation and there are so many groups,
> the number of request descriptors distributed to each group can be too
> few.  My only idea for this is to make device-wide limitation flexible,
> but I'm not sure if it is the best or even can be allowed.
> 
> (2) implementation concern
> Now the limitation is done by generic block layer which doesn't know
> about grouping.  The idea in my head to solve this is to add a new
> interface on elevator_ops to ask IO scheduler if a new request can
> be allocated.
> 

Acutally it is good point. We already call into CFQ (cfq_may_queue()) for
doing some kind of determination regarding what is the urgency of request
allocation.

May be we can just keep track of how many outstanding requests are there
per group in CFQ. And inside CFQ always allow request allocation for the
active group. We can probably not allow this if a group has already got
many requests backlogged (say more than 16).

We might overshoot number of request descriptors on device wide limitation
but we do any way (allow upto 50% more requests descriptors etc).

So not introducing per group limit through sysfs and just doing some rough
internal calculations in CFQ and being little flexible with over allocation
of request descriptors, it might reduce complexity.

But it probably will not solve the problem of higher layer asking if queue
is congested or not. It might happen that request queue is overall congested
but a high priority group should not be affected by that and still be able
to submit requests. I think this primarily is used only in WRITE paths. So
READ path should still be fine.

Once WRITE support is in, we need to probably introduce additional
mechanism where we can queury per bdi per group congestion instead of
per bdi congestion. One group might be congested and but not the other
one. I had done that in my previous postings.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 00/11] blkiocg async support
  2010-08-03 19:24     ` Nauman Rafique
@ 2010-08-04 14:32       ` Munehiro Ikeda
  0 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-08-04 14:32 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Vivek Goyal, linux-kernel, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

Nauman Rafique wrote, on 08/03/2010 03:24 PM:
> On Tue, Aug 3, 2010 at 7:31 AM, Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
>> Vivek Goyal wrote, on 08/02/2010 04:58 PM:
>>> You will require one more piece and that is support for per cgroup request
>>> descriptors on request queue. With writes, it is so easy to consume those
>>> 128 request descriptors.
>>
>> Hi Vivek,
>>
>> Yes.  Thank you for the comment.
>> I have two concerns to do that.
>>
>> (1) technical concern
>> If there is fixed device-wide limitation and there are so many groups,
>> the number of request descriptors distributed to each group can be too
>> few.  My only idea for this is to make device-wide limitation flexible,
>> but I'm not sure if it is the best or even can be allowed.
>>
>> (2) implementation concern
>> Now the limitation is done by generic block layer which doesn't know
>> about grouping.  The idea in my head to solve this is to add a new
>> interface on elevator_ops to ask IO scheduler if a new request can
>> be allocated.
>
> Muuhh,
> We have already done the work of forward porting the request
> descriptor patch that Vivek had in his earlier patch sets. We also
> taken care of the two concerns you have mentioned above. We have been
> testing it, and getting good numbers. So if you want, I can send the
> patch your way so it can be included in this same patch series.
>
> Thanks.

Hi Nauman,

It is the patch that I'm thinking we should be based on.  You have
already done the forward porting, great!
Please post it to LKML, container-list etc. independently if you don't
mind.  I appreciate your suggestion to include it into my patch
series, but I'm worrying about that the patch set becomes larger
beyond my poor antique brain processor.
The issue of request limitation may be significant when async write is
supported, but I don't think it is limited to it.  It should be
beneficial for current blkio controller.
And we can combine them after independent posts if needed.


Thanks a lot,
Muuhh

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group
  2010-07-09  3:22 ` [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group Munehiro Ikeda
@ 2010-08-13  1:24   ` Nauman Rafique
  2010-08-13 21:00     ` Munehiro Ikeda
  0 siblings, 1 reply; 53+ messages in thread
From: Nauman Rafique @ 2010-08-13  1:24 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, jens.axboe, Vivek Goyal, Ryo Tsuruta, taka,
	kamezawa.hiroyu, Andrea Righi, Gui Jianfeng, akpm, balbir

Muuhh,
I do not understand the motivation behind making cfq_io_context per
cgroup. We can extract cgroup id from bio, use that to lookup cgroup
and its async queues. What am I missing?

On Thu, Jul 8, 2010 at 8:22 PM, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> This is the main body for async capability of blkio controller.
> The basic ideas are
>  - To move async queues from cfq_data to cfq_group, and
>  - To associate cfq_io_context with cfq_group
>
> Each cfq_io_context, which was created per an io_context
> per a device, is now created per an io_context per a cfq_group.
> Each cfq_group is created per a cgroup per a device, so
> cfq_io_context is now resulted in to be created per
> an io_context per a cgroup per a device.
>
> To protect link between cfq_io_context and cfq_group,
> locking code is modified in several parts.
>
> ToDo:
> - Lock protection still might be imperfect and more investigation
>  is needed.
>
> - cic_index of root cfq_group (cfqd->root_group.cic_index) is not
>  removed and lasts beyond elevator switching.
>  This issues should be solved.
>
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/cfq-iosched.c       |  277 ++++++++++++++++++++++++++++-----------------
>  include/linux/iocontext.h |    2 +-
>  2 files changed, 175 insertions(+), 104 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 68f76e9..4186c30 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -191,10 +191,23 @@ struct cfq_group {
>        struct cfq_rb_root service_trees[2][3];
>        struct cfq_rb_root service_tree_idle;
>
> +       /*
> +        * async queue for each priority case
> +        */
> +       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
> +       struct cfq_queue *async_idle_cfqq;
> +
>        unsigned long saved_workload_slice;
>        enum wl_type_t saved_workload;
>        enum wl_prio_t saved_serving_prio;
>        struct blkio_group blkg;
> +       struct cfq_data *cfqd;
> +
> +       /* lock for cic_list */
> +       spinlock_t lock;
> +       unsigned int cic_index;
> +       struct list_head cic_list;
> +
>  #ifdef CONFIG_CFQ_GROUP_IOSCHED
>        struct hlist_node cfqd_node;
>        atomic_t ref;
> @@ -254,12 +267,6 @@ struct cfq_data {
>        struct cfq_queue *active_queue;
>        struct cfq_io_context *active_cic;
>
> -       /*
> -        * async queue for each priority case
> -        */
> -       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
> -       struct cfq_queue *async_idle_cfqq;
> -
>        sector_t last_position;
>
>        /*
> @@ -275,8 +282,6 @@ struct cfq_data {
>        unsigned int cfq_latency;
>        unsigned int cfq_group_isolation;
>
> -       unsigned int cic_index;
> -       struct list_head cic_list;
>
>        /*
>         * Fallback dummy cfqq for extreme OOM conditions
> @@ -418,10 +423,16 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
>  }
>
>  static void cfq_dispatch_insert(struct request_queue *, struct request *);
> +static void __cfq_exit_single_io_context(struct cfq_data *,
> +                                        struct cfq_group *,
> +                                        struct cfq_io_context *);
>  static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
> -                                      struct io_context *, gfp_t);
> +                                      struct cfq_io_context *, gfp_t);
>  static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
> -                                               struct io_context *);
> +                                            struct cfq_group *,
> +                                            struct io_context *);
> +static void cfq_put_async_queues(struct cfq_group *);
> +static int cfq_alloc_cic_index(void);
>
>  static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
>                                            bool is_sync)
> @@ -438,17 +449,28 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
>  #define CIC_DEAD_KEY   1ul
>  #define CIC_DEAD_INDEX_SHIFT   1
>
> -static inline void *cfqd_dead_key(struct cfq_data *cfqd)
> +static inline void *cfqg_dead_key(struct cfq_group *cfqg)
>  {
> -       return (void *)(cfqd->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
> +       return (void *)(cfqg->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
> +}
> +
> +static inline struct cfq_group *cic_to_cfqg(struct cfq_io_context *cic)
> +{
> +       struct cfq_group *cfqg = cic->key;
> +
> +       if (unlikely((unsigned long) cfqg & CIC_DEAD_KEY))
> +               cfqg = NULL;
> +
> +       return cfqg;
>  }
>
>  static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
>  {
> -       struct cfq_data *cfqd = cic->key;
> +       struct cfq_data *cfqd = NULL;
> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>
> -       if (unlikely((unsigned long) cfqd & CIC_DEAD_KEY))
> -               return NULL;
> +       if (likely(cfqg))
> +               cfqd =  cfqg->cfqd;
>
>        return cfqd;
>  }
> @@ -959,12 +981,12 @@ cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
>  }
>
>  static struct cfq_group *
> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
> +                   int create)
>  {
> -       struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>        struct cfq_group *cfqg = NULL;
>        void *key = cfqd;
> -       int i, j;
> +       int i, j, idx;
>        struct cfq_rb_root *st;
>        struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>        unsigned int major, minor;
> @@ -978,14 +1000,21 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>        if (cfqg || !create)
>                goto done;
>
> +       idx = cfq_alloc_cic_index();
> +       if (idx < 0)
> +               goto done;
> +
>        cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>        if (!cfqg)
> -               goto done;
> +               goto err;
>
>        for_each_cfqg_st(cfqg, i, j, st)
>                *st = CFQ_RB_ROOT;
>        RB_CLEAR_NODE(&cfqg->rb_node);
>
> +       spin_lock_init(&cfqg->lock);
> +       INIT_LIST_HEAD(&cfqg->cic_list);
> +
>        /*
>         * Take the initial reference that will be released on destroy
>         * This can be thought of a joint reference by cgroup and
> @@ -1002,7 +1031,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>
>        /* Add group on cfqd list */
>        hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
> +       cfqg->cfqd = cfqd;
> +       cfqg->cic_index = idx;
> +       goto done;
>
> +err:
> +       spin_lock(&cic_index_lock);
> +       ida_remove(&cic_index_ida, idx);
> +       spin_unlock(&cic_index_lock);
>  done:
>        return cfqg;
>  }
> @@ -1095,10 +1131,6 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
>
>  static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
>  {
> -       /* Currently, all async queues are mapped to root group */
> -       if (!cfq_cfqq_sync(cfqq))
> -               cfqg = &cfqq->cfqd->root_group;
> -
>        cfqq->cfqg = cfqg;
>        /* cfqq reference on cfqg */
>        atomic_inc(&cfqq->cfqg->ref);
> @@ -1114,6 +1146,16 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>                return;
>        for_each_cfqg_st(cfqg, i, j, st)
>                BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
> +
> +       /**
> +        * ToDo:
> +        * root_group never reaches here and cic_index is never
> +        * removed.  Unused cic_index lasts by elevator switching.
> +        */
> +       spin_lock(&cic_index_lock);
> +       ida_remove(&cic_index_ida, cfqg->cic_index);
> +       spin_unlock(&cic_index_lock);
> +
>        kfree(cfqg);
>  }
>
> @@ -1122,6 +1164,15 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
>        /* Something wrong if we are trying to remove same group twice */
>        BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
>
> +       while (!list_empty(&cfqg->cic_list)) {
> +               struct cfq_io_context *cic = list_entry(cfqg->cic_list.next,
> +                                                       struct cfq_io_context,
> +                                                       group_list);
> +
> +               __cfq_exit_single_io_context(cfqd, cfqg, cic);
> +       }
> +
> +       cfq_put_async_queues(cfqg);
>        hlist_del_init(&cfqg->cfqd_node);
>
>        /*
> @@ -1497,10 +1548,12 @@ static struct request *
>  cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
>  {
>        struct task_struct *tsk = current;
> +       struct cfq_group *cfqg;
>        struct cfq_io_context *cic;
>        struct cfq_queue *cfqq;
>
> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>        if (!cic)
>                return NULL;
>
> @@ -1611,6 +1664,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>                           struct bio *bio)
>  {
>        struct cfq_data *cfqd = q->elevator->elevator_data;
> +       struct cfq_group *cfqg;
>        struct cfq_io_context *cic;
>        struct cfq_queue *cfqq;
>
> @@ -1624,7 +1678,8 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>         * Lookup the cfqq that this bio will be queued with. Allow
>         * merge only if rq is queued there.
>         */
> -       cic = cfq_cic_lookup(cfqd, current->io_context);
> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
> +       cic = cfq_cic_lookup(cfqd, cfqg, current->io_context);
>        if (!cic)
>                return false;
>
> @@ -2667,17 +2722,18 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  }
>
>  static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
> +                                        struct cfq_group *cfqg,
>                                         struct cfq_io_context *cic)
>  {
>        struct io_context *ioc = cic->ioc;
>
> -       list_del_init(&cic->queue_list);
> +       list_del_init(&cic->group_list);
>
>        /*
>         * Make sure dead mark is seen for dead queues
>         */
>        smp_wmb();
> -       cic->key = cfqd_dead_key(cfqd);
> +       cic->key = cfqg_dead_key(cfqg);
>
>        if (ioc->ioc_data == cic)
>                rcu_assign_pointer(ioc->ioc_data, NULL);
> @@ -2696,23 +2752,23 @@ static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
>  static void cfq_exit_single_io_context(struct io_context *ioc,
>                                       struct cfq_io_context *cic)
>  {
> -       struct cfq_data *cfqd = cic_to_cfqd(cic);
> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>
> -       if (cfqd) {
> -               struct request_queue *q = cfqd->queue;
> +       if (cfqg) {
> +               struct cfq_data *cfqd = cfqg->cfqd;
>                unsigned long flags;
>
> -               spin_lock_irqsave(q->queue_lock, flags);
> +               spin_lock_irqsave(&cfqg->lock, flags);
>
>                /*
>                 * Ensure we get a fresh copy of the ->key to prevent
>                 * race between exiting task and queue
>                 */
>                smp_read_barrier_depends();
> -               if (cic->key == cfqd)
> -                       __cfq_exit_single_io_context(cfqd, cic);
> +               if (cic->key == cfqg)
> +                       __cfq_exit_single_io_context(cfqd, cfqg, cic);
>
> -               spin_unlock_irqrestore(q->queue_lock, flags);
> +               spin_unlock_irqrestore(&cfqg->lock, flags);
>        }
>  }
>
> @@ -2734,7 +2790,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>                                                        cfqd->queue->node);
>        if (cic) {
>                cic->last_end_request = jiffies;
> -               INIT_LIST_HEAD(&cic->queue_list);
> +               INIT_LIST_HEAD(&cic->group_list);
>                INIT_HLIST_NODE(&cic->cic_list);
>                cic->dtor = cfq_free_io_context;
>                cic->exit = cfq_exit_io_context;
> @@ -2801,8 +2857,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
>        cfqq = cic->cfqq[BLK_RW_ASYNC];
>        if (cfqq) {
>                struct cfq_queue *new_cfqq;
> -               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
> -                                               GFP_ATOMIC);
> +               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic, GFP_ATOMIC);
>                if (new_cfqq) {
>                        cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
>                        cfq_put_queue(cfqq);
> @@ -2879,16 +2934,14 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
>
>  static struct cfq_queue *
>  cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
> -                    struct io_context *ioc, gfp_t gfp_mask)
> +                    struct cfq_io_context *cic, gfp_t gfp_mask)
>  {
>        struct cfq_queue *cfqq, *new_cfqq = NULL;
> -       struct cfq_io_context *cic;
> +       struct io_context *ioc = cic->ioc;
>        struct cfq_group *cfqg;
>
>  retry:
> -       cfqg = cfq_get_cfqg(cfqd, NULL, 1);
> -       cic = cfq_cic_lookup(cfqd, ioc);
> -       /* cic always exists here */
> +       cfqg = cic_to_cfqg(cic);
>        cfqq = cic_to_cfqq(cic, is_sync);
>
>        /*
> @@ -2930,36 +2983,38 @@ retry:
>  }
>
>  static struct cfq_queue **
> -cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
> +cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
>  {
>        switch (ioprio_class) {
>        case IOPRIO_CLASS_RT:
> -               return &cfqd->async_cfqq[0][ioprio];
> +               return &cfqg->async_cfqq[0][ioprio];
>        case IOPRIO_CLASS_BE:
> -               return &cfqd->async_cfqq[1][ioprio];
> +               return &cfqg->async_cfqq[1][ioprio];
>        case IOPRIO_CLASS_IDLE:
> -               return &cfqd->async_idle_cfqq;
> +               return &cfqg->async_idle_cfqq;
>        default:
>                BUG();
>        }
>  }
>
>  static struct cfq_queue *
> -cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
> +cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_context *cic,
>              gfp_t gfp_mask)
>  {
> +       struct io_context *ioc = cic->ioc;
>        const int ioprio = task_ioprio(ioc);
>        const int ioprio_class = task_ioprio_class(ioc);
> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>        struct cfq_queue **async_cfqq = NULL;
>        struct cfq_queue *cfqq = NULL;
>
>        if (!is_sync) {
> -               async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
> +               async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
>                cfqq = *async_cfqq;
>        }
>
>        if (!cfqq)
> -               cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
> +               cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, gfp_mask);
>
>        /*
>         * pin the queue now that it's allocated, scheduler exit will prune it
> @@ -2977,19 +3032,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
>  * We drop cfq io contexts lazily, so we may find a dead one.
>  */
>  static void
> -cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
> -                 struct cfq_io_context *cic)
> +cfq_drop_dead_cic(struct cfq_data *cfqd, struct cfq_group *cfqg,
> +                 struct io_context *ioc, struct cfq_io_context *cic)
>  {
>        unsigned long flags;
>
> -       WARN_ON(!list_empty(&cic->queue_list));
> -       BUG_ON(cic->key != cfqd_dead_key(cfqd));
> +       WARN_ON(!list_empty(&cic->group_list));
> +       BUG_ON(cic->key != cfqg_dead_key(cfqg));
>
>        spin_lock_irqsave(&ioc->lock, flags);
>
>        BUG_ON(ioc->ioc_data == cic);
>
> -       radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
> +       radix_tree_delete(&ioc->radix_root, cfqg->cic_index);
>        hlist_del_rcu(&cic->cic_list);
>        spin_unlock_irqrestore(&ioc->lock, flags);
>
> @@ -2997,11 +3052,14 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
>  }
>
>  static struct cfq_io_context *
> -cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
> +cfq_cic_lookup(struct cfq_data *cfqd, struct cfq_group *cfqg,
> +              struct io_context *ioc)
>  {
>        struct cfq_io_context *cic;
>        unsigned long flags;
>
> +       if (!cfqg)
> +               return NULL;
>        if (unlikely(!ioc))
>                return NULL;
>
> @@ -3011,18 +3069,18 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>         * we maintain a last-hit cache, to avoid browsing over the tree
>         */
>        cic = rcu_dereference(ioc->ioc_data);
> -       if (cic && cic->key == cfqd) {
> +       if (cic && cic->key == cfqg) {
>                rcu_read_unlock();
>                return cic;
>        }
>
>        do {
> -               cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index);
> +               cic = radix_tree_lookup(&ioc->radix_root, cfqg->cic_index);
>                rcu_read_unlock();
>                if (!cic)
>                        break;
> -               if (unlikely(cic->key != cfqd)) {
> -                       cfq_drop_dead_cic(cfqd, ioc, cic);
> +               if (unlikely(cic->key != cfqg)) {
> +                       cfq_drop_dead_cic(cfqd, cfqg, ioc, cic);
>                        rcu_read_lock();
>                        continue;
>                }
> @@ -3037,24 +3095,25 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>  }
>
>  /*
> - * Add cic into ioc, using cfqd as the search key. This enables us to lookup
> + * Add cic into ioc, using cfqg as the search key. This enables us to lookup
>  * the process specific cfq io context when entered from the block layer.
> - * Also adds the cic to a per-cfqd list, used when this queue is removed.
> + * Also adds the cic to a per-cfqg list, used when the group is removed.
> + * request_queue lock must be held.
>  */
> -static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
> -                       struct cfq_io_context *cic, gfp_t gfp_mask)
> +static int cfq_cic_link(struct cfq_data *cfqd, struct cfq_group *cfqg,
> +                       struct io_context *ioc, struct cfq_io_context *cic)
>  {
>        unsigned long flags;
>        int ret;
>
> -       ret = radix_tree_preload(gfp_mask);
> +       ret = radix_tree_preload(GFP_ATOMIC);
>        if (!ret) {
>                cic->ioc = ioc;
> -               cic->key = cfqd;
> +               cic->key = cfqg;
>
>                spin_lock_irqsave(&ioc->lock, flags);
>                ret = radix_tree_insert(&ioc->radix_root,
> -                                               cfqd->cic_index, cic);
> +                                               cfqg->cic_index, cic);
>                if (!ret)
>                        hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
>                spin_unlock_irqrestore(&ioc->lock, flags);
> @@ -3062,9 +3121,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>                radix_tree_preload_end();
>
>                if (!ret) {
> -                       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
> -                       list_add(&cic->queue_list, &cfqd->cic_list);
> -                       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
> +                       spin_lock_irqsave(&cfqg->lock, flags);
> +                       list_add(&cic->group_list, &cfqg->cic_list);
> +                       spin_unlock_irqrestore(&cfqg->lock, flags);
>                }
>        }
>
> @@ -3080,10 +3139,12 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>  * than one device managed by cfq.
>  */
>  static struct cfq_io_context *
> -cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
> +cfq_get_io_context(struct cfq_data *cfqd, struct bio *bio, gfp_t gfp_mask)
>  {
>        struct io_context *ioc = NULL;
>        struct cfq_io_context *cic;
> +       struct cfq_group *cfqg, *cfqg2;
> +       unsigned long flags;
>
>        might_sleep_if(gfp_mask & __GFP_WAIT);
>
> @@ -3091,18 +3152,38 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>        if (!ioc)
>                return NULL;
>
> -       cic = cfq_cic_lookup(cfqd, ioc);
> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
> +retry_cfqg:
> +       cfqg = cfq_get_cfqg(cfqd, bio, 1);
> +retry_cic:
> +       cic = cfq_cic_lookup(cfqd, cfqg, ioc);
>        if (cic)
>                goto out;
> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>
>        cic = cfq_alloc_io_context(cfqd, gfp_mask);
>        if (cic == NULL)
>                goto err;
>
> -       if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
> +
> +       /* check the consistency breakage during unlock period */
> +       cfqg2 = cfq_get_cfqg(cfqd, bio, 0);
> +       if (cfqg != cfqg2) {
> +               cfq_cic_free(cic);
> +               if (!cfqg2)
> +                       goto retry_cfqg;
> +               else {
> +                       cfqg = cfqg2;
> +                       goto retry_cic;
> +               }
> +       }
> +
> +       if (cfq_cic_link(cfqd, cfqg, ioc, cic))
>                goto err_free;
>
>  out:
> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>        smp_read_barrier_depends();
>        if (unlikely(ioc->ioprio_changed))
>                cfq_ioc_set_ioprio(ioc);
> @@ -3113,6 +3194,7 @@ out:
>  #endif
>        return cic;
>  err_free:
> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>        cfq_cic_free(cic);
>  err:
>        put_io_context(ioc);
> @@ -3537,6 +3619,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
>  static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
>  {
>        struct cfq_data *cfqd = q->elevator->elevator_data;
> +       struct cfq_group *cfqg;
>        struct task_struct *tsk = current;
>        struct cfq_io_context *cic;
>        struct cfq_queue *cfqq;
> @@ -3547,7 +3630,8 @@ static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
>         * so just lookup a possibly existing queue, or return 'may queue'
>         * if that fails
>         */
> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>        if (!cic)
>                return ELV_MQUEUE_MAY;
>
> @@ -3636,7 +3720,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>
>        might_sleep_if(gfp_mask & __GFP_WAIT);
>
> -       cic = cfq_get_io_context(cfqd, gfp_mask);
> +       cic = cfq_get_io_context(cfqd, bio, gfp_mask);
>
>        spin_lock_irqsave(q->queue_lock, flags);
>
> @@ -3646,7 +3730,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>  new_queue:
>        cfqq = cic_to_cfqq(cic, is_sync);
>        if (!cfqq || cfqq == &cfqd->oom_cfqq) {
> -               cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
> +               cfqq = cfq_get_queue(cfqd, is_sync, cic, gfp_mask);
>                cic_set_cfqq(cic, cfqq, is_sync);
>        } else {
>                /*
> @@ -3762,19 +3846,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
>        cancel_work_sync(&cfqd->unplug_work);
>  }
>
> -static void cfq_put_async_queues(struct cfq_data *cfqd)
> +static void cfq_put_async_queues(struct cfq_group *cfqg)
>  {
>        int i;
>
>        for (i = 0; i < IOPRIO_BE_NR; i++) {
> -               if (cfqd->async_cfqq[0][i])
> -                       cfq_put_queue(cfqd->async_cfqq[0][i]);
> -               if (cfqd->async_cfqq[1][i])
> -                       cfq_put_queue(cfqd->async_cfqq[1][i]);
> +               if (cfqg->async_cfqq[0][i])
> +                       cfq_put_queue(cfqg->async_cfqq[0][i]);
> +               if (cfqg->async_cfqq[1][i])
> +                       cfq_put_queue(cfqg->async_cfqq[1][i]);
>        }
>
> -       if (cfqd->async_idle_cfqq)
> -               cfq_put_queue(cfqd->async_idle_cfqq);
> +       if (cfqg->async_idle_cfqq)
> +               cfq_put_queue(cfqg->async_idle_cfqq);
>  }
>
>  static void cfq_cfqd_free(struct rcu_head *head)
> @@ -3794,15 +3878,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
>        if (cfqd->active_queue)
>                __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
>
> -       while (!list_empty(&cfqd->cic_list)) {
> -               struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
> -                                                       struct cfq_io_context,
> -                                                       queue_list);
> -
> -               __cfq_exit_single_io_context(cfqd, cic);
> -       }
> -
> -       cfq_put_async_queues(cfqd);
>        cfq_release_cfq_groups(cfqd);
>        cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
>
> @@ -3810,10 +3885,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
>
>        cfq_shutdown_timer_wq(cfqd);
>
> -       spin_lock(&cic_index_lock);
> -       ida_remove(&cic_index_ida, cfqd->cic_index);
> -       spin_unlock(&cic_index_lock);
> -
>        /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
>        call_rcu(&cfqd->rcu, cfq_cfqd_free);
>  }
> @@ -3823,7 +3894,7 @@ static int cfq_alloc_cic_index(void)
>        int index, error;
>
>        do {
> -               if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
> +               if (!ida_pre_get(&cic_index_ida, GFP_ATOMIC))
>                        return -ENOMEM;
>
>                spin_lock(&cic_index_lock);
> @@ -3839,20 +3910,18 @@ static int cfq_alloc_cic_index(void)
>  static void *cfq_init_queue(struct request_queue *q)
>  {
>        struct cfq_data *cfqd;
> -       int i, j;
> +       int i, j, idx;
>        struct cfq_group *cfqg;
>        struct cfq_rb_root *st;
>
> -       i = cfq_alloc_cic_index();
> -       if (i < 0)
> +       idx = cfq_alloc_cic_index();
> +       if (idx < 0)
>                return NULL;
>
>        cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
>        if (!cfqd)
>                return NULL;
>
> -       cfqd->cic_index = i;
> -
>        /* Init root service tree */
>        cfqd->grp_service_tree = CFQ_RB_ROOT;
>
> @@ -3861,6 +3930,9 @@ static void *cfq_init_queue(struct request_queue *q)
>        for_each_cfqg_st(cfqg, i, j, st)
>                *st = CFQ_RB_ROOT;
>        RB_CLEAR_NODE(&cfqg->rb_node);
> +       cfqg->cfqd = cfqd;
> +       cfqg->cic_index = idx;
> +       INIT_LIST_HEAD(&cfqg->cic_list);
>
>        /* Give preference to root group over other groups */
>        cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
> @@ -3874,6 +3946,7 @@ static void *cfq_init_queue(struct request_queue *q)
>        rcu_read_lock();
>        cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
>                                        (void *)cfqd, 0);
> +       hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>        rcu_read_unlock();
>  #endif
>        /*
> @@ -3893,8 +3966,6 @@ static void *cfq_init_queue(struct request_queue *q)
>        atomic_inc(&cfqd->oom_cfqq.ref);
>        cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
>
> -       INIT_LIST_HEAD(&cfqd->cic_list);
> -
>        cfqd->queue = q;
>
>        init_timer(&cfqd->idle_slice_timer);
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 64d5291..6c05b54 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -18,7 +18,7 @@ struct cfq_io_context {
>        unsigned long ttime_samples;
>        unsigned long ttime_mean;
>
> -       struct list_head queue_list;
> +       struct list_head group_list;
>        struct hlist_node cic_list;
>
>        void (*dtor)(struct io_context *); /* destructor */
> --
> 1.6.2.5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group
  2010-08-13  1:24   ` Nauman Rafique
@ 2010-08-13 21:00     ` Munehiro Ikeda
  2010-08-13 23:01       ` Nauman Rafique
  0 siblings, 1 reply; 53+ messages in thread
From: Munehiro Ikeda @ 2010-08-13 21:00 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

Nauman Rafique wrote, on 08/12/2010 09:24 PM:
> Muuhh,
> I do not understand the motivation behind making cfq_io_context per
> cgroup. We can extract cgroup id from bio, use that to lookup cgroup
> and its async queues. What am I missing?

What cgroup ID can suggest is only cgroup to which the thread belongs,
but not the thread itself.  This means that IO priority and IO prio-class
can't be determined by cgroup ID.
The pointers of async queues are held in cfqg->async_cfqq[class][prio].
It is impossible to find out which queue is appropriate by only cgroup
ID if considering IO priority.

Frankly speaking, I'm not 100% confident if IO priority should be
applied on async write IOs, but anyway, I made up my mind to keep the
current scheme.

Do I make sense?  If you have any other idea, I am glad to hear.


Thanks,
Muuhh


>
> On Thu, Jul 8, 2010 at 8:22 PM, Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
>> This is the main body for async capability of blkio controller.
>> The basic ideas are
>>   - To move async queues from cfq_data to cfq_group, and
>>   - To associate cfq_io_context with cfq_group
>>
>> Each cfq_io_context, which was created per an io_context
>> per a device, is now created per an io_context per a cfq_group.
>> Each cfq_group is created per a cgroup per a device, so
>> cfq_io_context is now resulted in to be created per
>> an io_context per a cgroup per a device.
>>
>> To protect link between cfq_io_context and cfq_group,
>> locking code is modified in several parts.
>>
>> ToDo:
>> - Lock protection still might be imperfect and more investigation
>>   is needed.
>>
>> - cic_index of root cfq_group (cfqd->root_group.cic_index) is not
>>   removed and lasts beyond elevator switching.
>>   This issues should be solved.
>>
>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>> ---
>>   block/cfq-iosched.c       |  277 ++++++++++++++++++++++++++++-----------------
>>   include/linux/iocontext.h |    2 +-
>>   2 files changed, 175 insertions(+), 104 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 68f76e9..4186c30 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -191,10 +191,23 @@ struct cfq_group {
>>         struct cfq_rb_root service_trees[2][3];
>>         struct cfq_rb_root service_tree_idle;
>>
>> +       /*
>> +        * async queue for each priority case
>> +        */
>> +       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>> +       struct cfq_queue *async_idle_cfqq;
>> +
>>         unsigned long saved_workload_slice;
>>         enum wl_type_t saved_workload;
>>         enum wl_prio_t saved_serving_prio;
>>         struct blkio_group blkg;
>> +       struct cfq_data *cfqd;
>> +
>> +       /* lock for cic_list */
>> +       spinlock_t lock;
>> +       unsigned int cic_index;
>> +       struct list_head cic_list;
>> +
>>   #ifdef CONFIG_CFQ_GROUP_IOSCHED
>>         struct hlist_node cfqd_node;
>>         atomic_t ref;
>> @@ -254,12 +267,6 @@ struct cfq_data {
>>         struct cfq_queue *active_queue;
>>         struct cfq_io_context *active_cic;
>>
>> -       /*
>> -        * async queue for each priority case
>> -        */
>> -       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>> -       struct cfq_queue *async_idle_cfqq;
>> -
>>         sector_t last_position;
>>
>>         /*
>> @@ -275,8 +282,6 @@ struct cfq_data {
>>         unsigned int cfq_latency;
>>         unsigned int cfq_group_isolation;
>>
>> -       unsigned int cic_index;
>> -       struct list_head cic_list;
>>
>>         /*
>>          * Fallback dummy cfqq for extreme OOM conditions
>> @@ -418,10 +423,16 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
>>   }
>>
>>   static void cfq_dispatch_insert(struct request_queue *, struct request *);
>> +static void __cfq_exit_single_io_context(struct cfq_data *,
>> +                                        struct cfq_group *,
>> +                                        struct cfq_io_context *);
>>   static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
>> -                                      struct io_context *, gfp_t);
>> +                                      struct cfq_io_context *, gfp_t);
>>   static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
>> -                                               struct io_context *);
>> +                                            struct cfq_group *,
>> +                                            struct io_context *);
>> +static void cfq_put_async_queues(struct cfq_group *);
>> +static int cfq_alloc_cic_index(void);
>>
>>   static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
>>                                             bool is_sync)
>> @@ -438,17 +449,28 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
>>   #define CIC_DEAD_KEY   1ul
>>   #define CIC_DEAD_INDEX_SHIFT   1
>>
>> -static inline void *cfqd_dead_key(struct cfq_data *cfqd)
>> +static inline void *cfqg_dead_key(struct cfq_group *cfqg)
>>   {
>> -       return (void *)(cfqd->cic_index<<  CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
>> +       return (void *)(cfqg->cic_index<<  CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
>> +}
>> +
>> +static inline struct cfq_group *cic_to_cfqg(struct cfq_io_context *cic)
>> +{
>> +       struct cfq_group *cfqg = cic->key;
>> +
>> +       if (unlikely((unsigned long) cfqg&  CIC_DEAD_KEY))
>> +               cfqg = NULL;
>> +
>> +       return cfqg;
>>   }
>>
>>   static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
>>   {
>> -       struct cfq_data *cfqd = cic->key;
>> +       struct cfq_data *cfqd = NULL;
>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>
>> -       if (unlikely((unsigned long) cfqd&  CIC_DEAD_KEY))
>> -               return NULL;
>> +       if (likely(cfqg))
>> +               cfqd =  cfqg->cfqd;
>>
>>         return cfqd;
>>   }
>> @@ -959,12 +981,12 @@ cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
>>   }
>>
>>   static struct cfq_group *
>> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
>> +                   int create)
>>   {
>> -       struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>         struct cfq_group *cfqg = NULL;
>>         void *key = cfqd;
>> -       int i, j;
>> +       int i, j, idx;
>>         struct cfq_rb_root *st;
>>         struct backing_dev_info *bdi =&cfqd->queue->backing_dev_info;
>>         unsigned int major, minor;
>> @@ -978,14 +1000,21 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>>         if (cfqg || !create)
>>                 goto done;
>>
>> +       idx = cfq_alloc_cic_index();
>> +       if (idx<  0)
>> +               goto done;
>> +
>>         cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>>         if (!cfqg)
>> -               goto done;
>> +               goto err;
>>
>>         for_each_cfqg_st(cfqg, i, j, st)
>>                 *st = CFQ_RB_ROOT;
>>         RB_CLEAR_NODE(&cfqg->rb_node);
>>
>> +       spin_lock_init(&cfqg->lock);
>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>> +
>>         /*
>>          * Take the initial reference that will be released on destroy
>>          * This can be thought of a joint reference by cgroup and
>> @@ -1002,7 +1031,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>>
>>         /* Add group on cfqd list */
>>         hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>> +       cfqg->cfqd = cfqd;
>> +       cfqg->cic_index = idx;
>> +       goto done;
>>
>> +err:
>> +       spin_lock(&cic_index_lock);
>> +       ida_remove(&cic_index_ida, idx);
>> +       spin_unlock(&cic_index_lock);
>>   done:
>>         return cfqg;
>>   }
>> @@ -1095,10 +1131,6 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
>>
>>   static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
>>   {
>> -       /* Currently, all async queues are mapped to root group */
>> -       if (!cfq_cfqq_sync(cfqq))
>> -               cfqg =&cfqq->cfqd->root_group;
>> -
>>         cfqq->cfqg = cfqg;
>>         /* cfqq reference on cfqg */
>>         atomic_inc(&cfqq->cfqg->ref);
>> @@ -1114,6 +1146,16 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>>                 return;
>>         for_each_cfqg_st(cfqg, i, j, st)
>>                 BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
>> +
>> +       /**
>> +        * ToDo:
>> +        * root_group never reaches here and cic_index is never
>> +        * removed.  Unused cic_index lasts by elevator switching.
>> +        */
>> +       spin_lock(&cic_index_lock);
>> +       ida_remove(&cic_index_ida, cfqg->cic_index);
>> +       spin_unlock(&cic_index_lock);
>> +
>>         kfree(cfqg);
>>   }
>>
>> @@ -1122,6 +1164,15 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>         /* Something wrong if we are trying to remove same group twice */
>>         BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
>>
>> +       while (!list_empty(&cfqg->cic_list)) {
>> +               struct cfq_io_context *cic = list_entry(cfqg->cic_list.next,
>> +                                                       struct cfq_io_context,
>> +                                                       group_list);
>> +
>> +               __cfq_exit_single_io_context(cfqd, cfqg, cic);
>> +       }
>> +
>> +       cfq_put_async_queues(cfqg);
>>         hlist_del_init(&cfqg->cfqd_node);
>>
>>         /*
>> @@ -1497,10 +1548,12 @@ static struct request *
>>   cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
>>   {
>>         struct task_struct *tsk = current;
>> +       struct cfq_group *cfqg;
>>         struct cfq_io_context *cic;
>>         struct cfq_queue *cfqq;
>>
>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>         if (!cic)
>>                 return NULL;
>>
>> @@ -1611,6 +1664,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>>                            struct bio *bio)
>>   {
>>         struct cfq_data *cfqd = q->elevator->elevator_data;
>> +       struct cfq_group *cfqg;
>>         struct cfq_io_context *cic;
>>         struct cfq_queue *cfqq;
>>
>> @@ -1624,7 +1678,8 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>>          * Lookup the cfqq that this bio will be queued with. Allow
>>          * merge only if rq is queued there.
>>          */
>> -       cic = cfq_cic_lookup(cfqd, current->io_context);
>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>> +       cic = cfq_cic_lookup(cfqd, cfqg, current->io_context);
>>         if (!cic)
>>                 return false;
>>
>> @@ -2667,17 +2722,18 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>   }
>>
>>   static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
>> +                                        struct cfq_group *cfqg,
>>                                          struct cfq_io_context *cic)
>>   {
>>         struct io_context *ioc = cic->ioc;
>>
>> -       list_del_init(&cic->queue_list);
>> +       list_del_init(&cic->group_list);
>>
>>         /*
>>          * Make sure dead mark is seen for dead queues
>>          */
>>         smp_wmb();
>> -       cic->key = cfqd_dead_key(cfqd);
>> +       cic->key = cfqg_dead_key(cfqg);
>>
>>         if (ioc->ioc_data == cic)
>>                 rcu_assign_pointer(ioc->ioc_data, NULL);
>> @@ -2696,23 +2752,23 @@ static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
>>   static void cfq_exit_single_io_context(struct io_context *ioc,
>>                                        struct cfq_io_context *cic)
>>   {
>> -       struct cfq_data *cfqd = cic_to_cfqd(cic);
>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>
>> -       if (cfqd) {
>> -               struct request_queue *q = cfqd->queue;
>> +       if (cfqg) {
>> +               struct cfq_data *cfqd = cfqg->cfqd;
>>                 unsigned long flags;
>>
>> -               spin_lock_irqsave(q->queue_lock, flags);
>> +               spin_lock_irqsave(&cfqg->lock, flags);
>>
>>                 /*
>>                  * Ensure we get a fresh copy of the ->key to prevent
>>                  * race between exiting task and queue
>>                  */
>>                 smp_read_barrier_depends();
>> -               if (cic->key == cfqd)
>> -                       __cfq_exit_single_io_context(cfqd, cic);
>> +               if (cic->key == cfqg)
>> +                       __cfq_exit_single_io_context(cfqd, cfqg, cic);
>>
>> -               spin_unlock_irqrestore(q->queue_lock, flags);
>> +               spin_unlock_irqrestore(&cfqg->lock, flags);
>>         }
>>   }
>>
>> @@ -2734,7 +2790,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>>                                                         cfqd->queue->node);
>>         if (cic) {
>>                 cic->last_end_request = jiffies;
>> -               INIT_LIST_HEAD(&cic->queue_list);
>> +               INIT_LIST_HEAD(&cic->group_list);
>>                 INIT_HLIST_NODE(&cic->cic_list);
>>                 cic->dtor = cfq_free_io_context;
>>                 cic->exit = cfq_exit_io_context;
>> @@ -2801,8 +2857,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
>>         cfqq = cic->cfqq[BLK_RW_ASYNC];
>>         if (cfqq) {
>>                 struct cfq_queue *new_cfqq;
>> -               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
>> -                                               GFP_ATOMIC);
>> +               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic, GFP_ATOMIC);
>>                 if (new_cfqq) {
>>                         cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
>>                         cfq_put_queue(cfqq);
>> @@ -2879,16 +2934,14 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
>>
>>   static struct cfq_queue *
>>   cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
>> -                    struct io_context *ioc, gfp_t gfp_mask)
>> +                    struct cfq_io_context *cic, gfp_t gfp_mask)
>>   {
>>         struct cfq_queue *cfqq, *new_cfqq = NULL;
>> -       struct cfq_io_context *cic;
>> +       struct io_context *ioc = cic->ioc;
>>         struct cfq_group *cfqg;
>>
>>   retry:
>> -       cfqg = cfq_get_cfqg(cfqd, NULL, 1);
>> -       cic = cfq_cic_lookup(cfqd, ioc);
>> -       /* cic always exists here */
>> +       cfqg = cic_to_cfqg(cic);
>>         cfqq = cic_to_cfqq(cic, is_sync);
>>
>>         /*
>> @@ -2930,36 +2983,38 @@ retry:
>>   }
>>
>>   static struct cfq_queue **
>> -cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
>> +cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
>>   {
>>         switch (ioprio_class) {
>>         case IOPRIO_CLASS_RT:
>> -               return&cfqd->async_cfqq[0][ioprio];
>> +               return&cfqg->async_cfqq[0][ioprio];
>>         case IOPRIO_CLASS_BE:
>> -               return&cfqd->async_cfqq[1][ioprio];
>> +               return&cfqg->async_cfqq[1][ioprio];
>>         case IOPRIO_CLASS_IDLE:
>> -               return&cfqd->async_idle_cfqq;
>> +               return&cfqg->async_idle_cfqq;
>>         default:
>>                 BUG();
>>         }
>>   }
>>
>>   static struct cfq_queue *
>> -cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
>> +cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_context *cic,
>>               gfp_t gfp_mask)
>>   {
>> +       struct io_context *ioc = cic->ioc;
>>         const int ioprio = task_ioprio(ioc);
>>         const int ioprio_class = task_ioprio_class(ioc);
>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>         struct cfq_queue **async_cfqq = NULL;
>>         struct cfq_queue *cfqq = NULL;
>>
>>         if (!is_sync) {
>> -               async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
>> +               async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
>>                 cfqq = *async_cfqq;
>>         }
>>
>>         if (!cfqq)
>> -               cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
>> +               cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, gfp_mask);
>>
>>         /*
>>          * pin the queue now that it's allocated, scheduler exit will prune it
>> @@ -2977,19 +3032,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
>>   * We drop cfq io contexts lazily, so we may find a dead one.
>>   */
>>   static void
>> -cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
>> -                 struct cfq_io_context *cic)
>> +cfq_drop_dead_cic(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> +                 struct io_context *ioc, struct cfq_io_context *cic)
>>   {
>>         unsigned long flags;
>>
>> -       WARN_ON(!list_empty(&cic->queue_list));
>> -       BUG_ON(cic->key != cfqd_dead_key(cfqd));
>> +       WARN_ON(!list_empty(&cic->group_list));
>> +       BUG_ON(cic->key != cfqg_dead_key(cfqg));
>>
>>         spin_lock_irqsave(&ioc->lock, flags);
>>
>>         BUG_ON(ioc->ioc_data == cic);
>>
>> -       radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
>> +       radix_tree_delete(&ioc->radix_root, cfqg->cic_index);
>>         hlist_del_rcu(&cic->cic_list);
>>         spin_unlock_irqrestore(&ioc->lock, flags);
>>
>> @@ -2997,11 +3052,14 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
>>   }
>>
>>   static struct cfq_io_context *
>> -cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>> +cfq_cic_lookup(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> +              struct io_context *ioc)
>>   {
>>         struct cfq_io_context *cic;
>>         unsigned long flags;
>>
>> +       if (!cfqg)
>> +               return NULL;
>>         if (unlikely(!ioc))
>>                 return NULL;
>>
>> @@ -3011,18 +3069,18 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>>          * we maintain a last-hit cache, to avoid browsing over the tree
>>          */
>>         cic = rcu_dereference(ioc->ioc_data);
>> -       if (cic&&  cic->key == cfqd) {
>> +       if (cic&&  cic->key == cfqg) {
>>                 rcu_read_unlock();
>>                 return cic;
>>         }
>>
>>         do {
>> -               cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index);
>> +               cic = radix_tree_lookup(&ioc->radix_root, cfqg->cic_index);
>>                 rcu_read_unlock();
>>                 if (!cic)
>>                         break;
>> -               if (unlikely(cic->key != cfqd)) {
>> -                       cfq_drop_dead_cic(cfqd, ioc, cic);
>> +               if (unlikely(cic->key != cfqg)) {
>> +                       cfq_drop_dead_cic(cfqd, cfqg, ioc, cic);
>>                         rcu_read_lock();
>>                         continue;
>>                 }
>> @@ -3037,24 +3095,25 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>>   }
>>
>>   /*
>> - * Add cic into ioc, using cfqd as the search key. This enables us to lookup
>> + * Add cic into ioc, using cfqg as the search key. This enables us to lookup
>>   * the process specific cfq io context when entered from the block layer.
>> - * Also adds the cic to a per-cfqd list, used when this queue is removed.
>> + * Also adds the cic to a per-cfqg list, used when the group is removed.
>> + * request_queue lock must be held.
>>   */
>> -static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>> -                       struct cfq_io_context *cic, gfp_t gfp_mask)
>> +static int cfq_cic_link(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> +                       struct io_context *ioc, struct cfq_io_context *cic)
>>   {
>>         unsigned long flags;
>>         int ret;
>>
>> -       ret = radix_tree_preload(gfp_mask);
>> +       ret = radix_tree_preload(GFP_ATOMIC);
>>         if (!ret) {
>>                 cic->ioc = ioc;
>> -               cic->key = cfqd;
>> +               cic->key = cfqg;
>>
>>                 spin_lock_irqsave(&ioc->lock, flags);
>>                 ret = radix_tree_insert(&ioc->radix_root,
>> -                                               cfqd->cic_index, cic);
>> +                                               cfqg->cic_index, cic);
>>                 if (!ret)
>>                         hlist_add_head_rcu(&cic->cic_list,&ioc->cic_list);
>>                 spin_unlock_irqrestore(&ioc->lock, flags);
>> @@ -3062,9 +3121,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>>                 radix_tree_preload_end();
>>
>>                 if (!ret) {
>> -                       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>> -                       list_add(&cic->queue_list,&cfqd->cic_list);
>> -                       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>> +                       spin_lock_irqsave(&cfqg->lock, flags);
>> +                       list_add(&cic->group_list,&cfqg->cic_list);
>> +                       spin_unlock_irqrestore(&cfqg->lock, flags);
>>                 }
>>         }
>>
>> @@ -3080,10 +3139,12 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>>   * than one device managed by cfq.
>>   */
>>   static struct cfq_io_context *
>> -cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>> +cfq_get_io_context(struct cfq_data *cfqd, struct bio *bio, gfp_t gfp_mask)
>>   {
>>         struct io_context *ioc = NULL;
>>         struct cfq_io_context *cic;
>> +       struct cfq_group *cfqg, *cfqg2;
>> +       unsigned long flags;
>>
>>         might_sleep_if(gfp_mask&  __GFP_WAIT);
>>
>> @@ -3091,18 +3152,38 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>>         if (!ioc)
>>                 return NULL;
>>
>> -       cic = cfq_cic_lookup(cfqd, ioc);
>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>> +retry_cfqg:
>> +       cfqg = cfq_get_cfqg(cfqd, bio, 1);
>> +retry_cic:
>> +       cic = cfq_cic_lookup(cfqd, cfqg, ioc);
>>         if (cic)
>>                 goto out;
>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>
>>         cic = cfq_alloc_io_context(cfqd, gfp_mask);
>>         if (cic == NULL)
>>                 goto err;
>>
>> -       if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>> +
>> +       /* check the consistency breakage during unlock period */
>> +       cfqg2 = cfq_get_cfqg(cfqd, bio, 0);
>> +       if (cfqg != cfqg2) {
>> +               cfq_cic_free(cic);
>> +               if (!cfqg2)
>> +                       goto retry_cfqg;
>> +               else {
>> +                       cfqg = cfqg2;
>> +                       goto retry_cic;
>> +               }
>> +       }
>> +
>> +       if (cfq_cic_link(cfqd, cfqg, ioc, cic))
>>                 goto err_free;
>>
>>   out:
>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>         smp_read_barrier_depends();
>>         if (unlikely(ioc->ioprio_changed))
>>                 cfq_ioc_set_ioprio(ioc);
>> @@ -3113,6 +3194,7 @@ out:
>>   #endif
>>         return cic;
>>   err_free:
>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>         cfq_cic_free(cic);
>>   err:
>>         put_io_context(ioc);
>> @@ -3537,6 +3619,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
>>   static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
>>   {
>>         struct cfq_data *cfqd = q->elevator->elevator_data;
>> +       struct cfq_group *cfqg;
>>         struct task_struct *tsk = current;
>>         struct cfq_io_context *cic;
>>         struct cfq_queue *cfqq;
>> @@ -3547,7 +3630,8 @@ static int cfq_may_queue(struct request_queue *q, struct bio *bio, int rw)
>>          * so just lookup a possibly existing queue, or return 'may queue'
>>          * if that fails
>>          */
>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>         if (!cic)
>>                 return ELV_MQUEUE_MAY;
>>
>> @@ -3636,7 +3720,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>>
>>         might_sleep_if(gfp_mask&  __GFP_WAIT);
>>
>> -       cic = cfq_get_io_context(cfqd, gfp_mask);
>> +       cic = cfq_get_io_context(cfqd, bio, gfp_mask);
>>
>>         spin_lock_irqsave(q->queue_lock, flags);
>>
>> @@ -3646,7 +3730,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>>   new_queue:
>>         cfqq = cic_to_cfqq(cic, is_sync);
>>         if (!cfqq || cfqq ==&cfqd->oom_cfqq) {
>> -               cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
>> +               cfqq = cfq_get_queue(cfqd, is_sync, cic, gfp_mask);
>>                 cic_set_cfqq(cic, cfqq, is_sync);
>>         } else {
>>                 /*
>> @@ -3762,19 +3846,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
>>         cancel_work_sync(&cfqd->unplug_work);
>>   }
>>
>> -static void cfq_put_async_queues(struct cfq_data *cfqd)
>> +static void cfq_put_async_queues(struct cfq_group *cfqg)
>>   {
>>         int i;
>>
>>         for (i = 0; i<  IOPRIO_BE_NR; i++) {
>> -               if (cfqd->async_cfqq[0][i])
>> -                       cfq_put_queue(cfqd->async_cfqq[0][i]);
>> -               if (cfqd->async_cfqq[1][i])
>> -                       cfq_put_queue(cfqd->async_cfqq[1][i]);
>> +               if (cfqg->async_cfqq[0][i])
>> +                       cfq_put_queue(cfqg->async_cfqq[0][i]);
>> +               if (cfqg->async_cfqq[1][i])
>> +                       cfq_put_queue(cfqg->async_cfqq[1][i]);
>>         }
>>
>> -       if (cfqd->async_idle_cfqq)
>> -               cfq_put_queue(cfqd->async_idle_cfqq);
>> +       if (cfqg->async_idle_cfqq)
>> +               cfq_put_queue(cfqg->async_idle_cfqq);
>>   }
>>
>>   static void cfq_cfqd_free(struct rcu_head *head)
>> @@ -3794,15 +3878,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
>>         if (cfqd->active_queue)
>>                 __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
>>
>> -       while (!list_empty(&cfqd->cic_list)) {
>> -               struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
>> -                                                       struct cfq_io_context,
>> -                                                       queue_list);
>> -
>> -               __cfq_exit_single_io_context(cfqd, cic);
>> -       }
>> -
>> -       cfq_put_async_queues(cfqd);
>>         cfq_release_cfq_groups(cfqd);
>>         cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
>>
>> @@ -3810,10 +3885,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
>>
>>         cfq_shutdown_timer_wq(cfqd);
>>
>> -       spin_lock(&cic_index_lock);
>> -       ida_remove(&cic_index_ida, cfqd->cic_index);
>> -       spin_unlock(&cic_index_lock);
>> -
>>         /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
>>         call_rcu(&cfqd->rcu, cfq_cfqd_free);
>>   }
>> @@ -3823,7 +3894,7 @@ static int cfq_alloc_cic_index(void)
>>         int index, error;
>>
>>         do {
>> -               if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
>> +               if (!ida_pre_get(&cic_index_ida, GFP_ATOMIC))
>>                         return -ENOMEM;
>>
>>                 spin_lock(&cic_index_lock);
>> @@ -3839,20 +3910,18 @@ static int cfq_alloc_cic_index(void)
>>   static void *cfq_init_queue(struct request_queue *q)
>>   {
>>         struct cfq_data *cfqd;
>> -       int i, j;
>> +       int i, j, idx;
>>         struct cfq_group *cfqg;
>>         struct cfq_rb_root *st;
>>
>> -       i = cfq_alloc_cic_index();
>> -       if (i<  0)
>> +       idx = cfq_alloc_cic_index();
>> +       if (idx<  0)
>>                 return NULL;
>>
>>         cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
>>         if (!cfqd)
>>                 return NULL;
>>
>> -       cfqd->cic_index = i;
>> -
>>         /* Init root service tree */
>>         cfqd->grp_service_tree = CFQ_RB_ROOT;
>>
>> @@ -3861,6 +3930,9 @@ static void *cfq_init_queue(struct request_queue *q)
>>         for_each_cfqg_st(cfqg, i, j, st)
>>                 *st = CFQ_RB_ROOT;
>>         RB_CLEAR_NODE(&cfqg->rb_node);
>> +       cfqg->cfqd = cfqd;
>> +       cfqg->cic_index = idx;
>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>>
>>         /* Give preference to root group over other groups */
>>         cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
>> @@ -3874,6 +3946,7 @@ static void *cfq_init_queue(struct request_queue *q)
>>         rcu_read_lock();
>>         cfq_blkiocg_add_blkio_group(&blkio_root_cgroup,&cfqg->blkg,
>>                                         (void *)cfqd, 0);
>> +       hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>>         rcu_read_unlock();
>>   #endif
>>         /*
>> @@ -3893,8 +3966,6 @@ static void *cfq_init_queue(struct request_queue *q)
>>         atomic_inc(&cfqd->oom_cfqq.ref);
>>         cfq_link_cfqq_cfqg(&cfqd->oom_cfqq,&cfqd->root_group);
>>
>> -       INIT_LIST_HEAD(&cfqd->cic_list);
>> -
>>         cfqd->queue = q;
>>
>>         init_timer(&cfqd->idle_slice_timer);
>> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
>> index 64d5291..6c05b54 100644
>> --- a/include/linux/iocontext.h
>> +++ b/include/linux/iocontext.h
>> @@ -18,7 +18,7 @@ struct cfq_io_context {
>>         unsigned long ttime_samples;
>>         unsigned long ttime_mean;
>>
>> -       struct list_head queue_list;
>> +       struct list_head group_list;
>>         struct hlist_node cic_list;
>>
>>         void (*dtor)(struct io_context *); /* destructor */
>> --
>> 1.6.2.5
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group
  2010-08-13 21:00     ` Munehiro Ikeda
@ 2010-08-13 23:01       ` Nauman Rafique
  2010-08-14  0:49         ` Munehiro Ikeda
  0 siblings, 1 reply; 53+ messages in thread
From: Nauman Rafique @ 2010-08-13 23:01 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

On Fri, Aug 13, 2010 at 2:00 PM, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> Nauman Rafique wrote, on 08/12/2010 09:24 PM:
>>
>> Muuhh,
>> I do not understand the motivation behind making cfq_io_context per
>> cgroup. We can extract cgroup id from bio, use that to lookup cgroup
>> and its async queues. What am I missing?
>
> What cgroup ID can suggest is only cgroup to which the thread belongs,
> but not the thread itself.  This means that IO priority and IO prio-class
> can't be determined by cgroup ID.

One way to do it would be to get ioprio and class from the context
that is used to submit the async request. IO priority and class is
tied to a thread anyways. And the io context of that thread can be
used to retrieve those values. Even if a thread is submitting IOs to
different cgroups, I don't see how you can apply different IO priority
and class to its async IOs for different cgroups. Please let me know
if it does not make sense.

> The pointers of async queues are held in cfqg->async_cfqq[class][prio].
> It is impossible to find out which queue is appropriate by only cgroup
> ID if considering IO priority.
>
> Frankly speaking, I'm not 100% confident if IO priority should be
> applied on async write IOs, but anyway, I made up my mind to keep the
> current scheme.
>
> Do I make sense?  If you have any other idea, I am glad to hear.
>
>
> Thanks,
> Muuhh
>
>
>>
>> On Thu, Jul 8, 2010 at 8:22 PM, Munehiro Ikeda<m-ikeda@ds.jp.nec.com>
>>  wrote:
>>>
>>> This is the main body for async capability of blkio controller.
>>> The basic ideas are
>>>  - To move async queues from cfq_data to cfq_group, and
>>>  - To associate cfq_io_context with cfq_group
>>>
>>> Each cfq_io_context, which was created per an io_context
>>> per a device, is now created per an io_context per a cfq_group.
>>> Each cfq_group is created per a cgroup per a device, so
>>> cfq_io_context is now resulted in to be created per
>>> an io_context per a cgroup per a device.
>>>
>>> To protect link between cfq_io_context and cfq_group,
>>> locking code is modified in several parts.
>>>
>>> ToDo:
>>> - Lock protection still might be imperfect and more investigation
>>>  is needed.
>>>
>>> - cic_index of root cfq_group (cfqd->root_group.cic_index) is not
>>>  removed and lasts beyond elevator switching.
>>>  This issues should be solved.
>>>
>>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>>> ---
>>>  block/cfq-iosched.c       |  277
>>> ++++++++++++++++++++++++++++-----------------
>>>  include/linux/iocontext.h |    2 +-
>>>  2 files changed, 175 insertions(+), 104 deletions(-)
>>>
>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> index 68f76e9..4186c30 100644
>>> --- a/block/cfq-iosched.c
>>> +++ b/block/cfq-iosched.c
>>> @@ -191,10 +191,23 @@ struct cfq_group {
>>>        struct cfq_rb_root service_trees[2][3];
>>>        struct cfq_rb_root service_tree_idle;
>>>
>>> +       /*
>>> +        * async queue for each priority case
>>> +        */
>>> +       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>>> +       struct cfq_queue *async_idle_cfqq;
>>> +
>>>        unsigned long saved_workload_slice;
>>>        enum wl_type_t saved_workload;
>>>        enum wl_prio_t saved_serving_prio;
>>>        struct blkio_group blkg;
>>> +       struct cfq_data *cfqd;
>>> +
>>> +       /* lock for cic_list */
>>> +       spinlock_t lock;
>>> +       unsigned int cic_index;
>>> +       struct list_head cic_list;
>>> +
>>>  #ifdef CONFIG_CFQ_GROUP_IOSCHED
>>>        struct hlist_node cfqd_node;
>>>        atomic_t ref;
>>> @@ -254,12 +267,6 @@ struct cfq_data {
>>>        struct cfq_queue *active_queue;
>>>        struct cfq_io_context *active_cic;
>>>
>>> -       /*
>>> -        * async queue for each priority case
>>> -        */
>>> -       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>>> -       struct cfq_queue *async_idle_cfqq;
>>> -
>>>        sector_t last_position;
>>>
>>>        /*
>>> @@ -275,8 +282,6 @@ struct cfq_data {
>>>        unsigned int cfq_latency;
>>>        unsigned int cfq_group_isolation;
>>>
>>> -       unsigned int cic_index;
>>> -       struct list_head cic_list;
>>>
>>>        /*
>>>         * Fallback dummy cfqq for extreme OOM conditions
>>> @@ -418,10 +423,16 @@ static inline int cfqg_busy_async_queues(struct
>>> cfq_data *cfqd,
>>>  }
>>>
>>>  static void cfq_dispatch_insert(struct request_queue *, struct request
>>> *);
>>> +static void __cfq_exit_single_io_context(struct cfq_data *,
>>> +                                        struct cfq_group *,
>>> +                                        struct cfq_io_context *);
>>>  static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
>>> -                                      struct io_context *, gfp_t);
>>> +                                      struct cfq_io_context *, gfp_t);
>>>  static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
>>> -                                               struct io_context *);
>>> +                                            struct cfq_group *,
>>> +                                            struct io_context *);
>>> +static void cfq_put_async_queues(struct cfq_group *);
>>> +static int cfq_alloc_cic_index(void);
>>>
>>>  static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
>>>                                            bool is_sync)
>>> @@ -438,17 +449,28 @@ static inline void cic_set_cfqq(struct
>>> cfq_io_context *cic,
>>>  #define CIC_DEAD_KEY   1ul
>>>  #define CIC_DEAD_INDEX_SHIFT   1
>>>
>>> -static inline void *cfqd_dead_key(struct cfq_data *cfqd)
>>> +static inline void *cfqg_dead_key(struct cfq_group *cfqg)
>>>  {
>>> -       return (void *)(cfqd->cic_index<<  CIC_DEAD_INDEX_SHIFT |
>>> CIC_DEAD_KEY);
>>> +       return (void *)(cfqg->cic_index<<  CIC_DEAD_INDEX_SHIFT |
>>> CIC_DEAD_KEY);
>>> +}
>>> +
>>> +static inline struct cfq_group *cic_to_cfqg(struct cfq_io_context *cic)
>>> +{
>>> +       struct cfq_group *cfqg = cic->key;
>>> +
>>> +       if (unlikely((unsigned long) cfqg&  CIC_DEAD_KEY))
>>> +               cfqg = NULL;
>>> +
>>> +       return cfqg;
>>>  }
>>>
>>>  static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
>>>  {
>>> -       struct cfq_data *cfqd = cic->key;
>>> +       struct cfq_data *cfqd = NULL;
>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>
>>> -       if (unlikely((unsigned long) cfqd&  CIC_DEAD_KEY))
>>> -               return NULL;
>>> +       if (likely(cfqg))
>>> +               cfqd =  cfqg->cfqd;
>>>
>>>        return cfqd;
>>>  }
>>> @@ -959,12 +981,12 @@ cfq_update_blkio_group_weight(struct blkio_group
>>> *blkg, unsigned int weight)
>>>  }
>>>
>>>  static struct cfq_group *
>>> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int
>>> create)
>>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
>>> +                   int create)
>>>  {
>>> -       struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>        struct cfq_group *cfqg = NULL;
>>>        void *key = cfqd;
>>> -       int i, j;
>>> +       int i, j, idx;
>>>        struct cfq_rb_root *st;
>>>        struct backing_dev_info *bdi =&cfqd->queue->backing_dev_info;
>>>        unsigned int major, minor;
>>> @@ -978,14 +1000,21 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct
>>> cgroup *cgroup, int create)
>>>        if (cfqg || !create)
>>>                goto done;
>>>
>>> +       idx = cfq_alloc_cic_index();
>>> +       if (idx<  0)
>>> +               goto done;
>>> +
>>>        cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>>>        if (!cfqg)
>>> -               goto done;
>>> +               goto err;
>>>
>>>        for_each_cfqg_st(cfqg, i, j, st)
>>>                *st = CFQ_RB_ROOT;
>>>        RB_CLEAR_NODE(&cfqg->rb_node);
>>>
>>> +       spin_lock_init(&cfqg->lock);
>>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>>> +
>>>        /*
>>>         * Take the initial reference that will be released on destroy
>>>         * This can be thought of a joint reference by cgroup and
>>> @@ -1002,7 +1031,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct
>>> cgroup *cgroup, int create)
>>>
>>>        /* Add group on cfqd list */
>>>        hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>>> +       cfqg->cfqd = cfqd;
>>> +       cfqg->cic_index = idx;
>>> +       goto done;
>>>
>>> +err:
>>> +       spin_lock(&cic_index_lock);
>>> +       ida_remove(&cic_index_ida, idx);
>>> +       spin_unlock(&cic_index_lock);
>>>  done:
>>>        return cfqg;
>>>  }
>>> @@ -1095,10 +1131,6 @@ static inline struct cfq_group
>>> *cfq_ref_get_cfqg(struct cfq_group *cfqg)
>>>
>>>  static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group
>>> *cfqg)
>>>  {
>>> -       /* Currently, all async queues are mapped to root group */
>>> -       if (!cfq_cfqq_sync(cfqq))
>>> -               cfqg =&cfqq->cfqd->root_group;
>>> -
>>>        cfqq->cfqg = cfqg;
>>>        /* cfqq reference on cfqg */
>>>        atomic_inc(&cfqq->cfqg->ref);
>>> @@ -1114,6 +1146,16 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>>>                return;
>>>        for_each_cfqg_st(cfqg, i, j, st)
>>>                BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
>>> +
>>> +       /**
>>> +        * ToDo:
>>> +        * root_group never reaches here and cic_index is never
>>> +        * removed.  Unused cic_index lasts by elevator switching.
>>> +        */
>>> +       spin_lock(&cic_index_lock);
>>> +       ida_remove(&cic_index_ida, cfqg->cic_index);
>>> +       spin_unlock(&cic_index_lock);
>>> +
>>>        kfree(cfqg);
>>>  }
>>>
>>> @@ -1122,6 +1164,15 @@ static void cfq_destroy_cfqg(struct cfq_data
>>> *cfqd, struct cfq_group *cfqg)
>>>        /* Something wrong if we are trying to remove same group twice */
>>>        BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
>>>
>>> +       while (!list_empty(&cfqg->cic_list)) {
>>> +               struct cfq_io_context *cic =
>>> list_entry(cfqg->cic_list.next,
>>> +                                                       struct
>>> cfq_io_context,
>>> +                                                       group_list);
>>> +
>>> +               __cfq_exit_single_io_context(cfqd, cfqg, cic);
>>> +       }
>>> +
>>> +       cfq_put_async_queues(cfqg);
>>>        hlist_del_init(&cfqg->cfqd_node);
>>>
>>>        /*
>>> @@ -1497,10 +1548,12 @@ static struct request *
>>>  cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
>>>  {
>>>        struct task_struct *tsk = current;
>>> +       struct cfq_group *cfqg;
>>>        struct cfq_io_context *cic;
>>>        struct cfq_queue *cfqq;
>>>
>>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>>        if (!cic)
>>>                return NULL;
>>>
>>> @@ -1611,6 +1664,7 @@ static int cfq_allow_merge(struct request_queue *q,
>>> struct request *rq,
>>>                           struct bio *bio)
>>>  {
>>>        struct cfq_data *cfqd = q->elevator->elevator_data;
>>> +       struct cfq_group *cfqg;
>>>        struct cfq_io_context *cic;
>>>        struct cfq_queue *cfqq;
>>>
>>> @@ -1624,7 +1678,8 @@ static int cfq_allow_merge(struct request_queue *q,
>>> struct request *rq,
>>>         * Lookup the cfqq that this bio will be queued with. Allow
>>>         * merge only if rq is queued there.
>>>         */
>>> -       cic = cfq_cic_lookup(cfqd, current->io_context);
>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>> +       cic = cfq_cic_lookup(cfqd, cfqg, current->io_context);
>>>        if (!cic)
>>>                return false;
>>>
>>> @@ -2667,17 +2722,18 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd,
>>> struct cfq_queue *cfqq)
>>>  }
>>>
>>>  static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
>>> +                                        struct cfq_group *cfqg,
>>>                                         struct cfq_io_context *cic)
>>>  {
>>>        struct io_context *ioc = cic->ioc;
>>>
>>> -       list_del_init(&cic->queue_list);
>>> +       list_del_init(&cic->group_list);
>>>
>>>        /*
>>>         * Make sure dead mark is seen for dead queues
>>>         */
>>>        smp_wmb();
>>> -       cic->key = cfqd_dead_key(cfqd);
>>> +       cic->key = cfqg_dead_key(cfqg);
>>>
>>>        if (ioc->ioc_data == cic)
>>>                rcu_assign_pointer(ioc->ioc_data, NULL);
>>> @@ -2696,23 +2752,23 @@ static void __cfq_exit_single_io_context(struct
>>> cfq_data *cfqd,
>>>  static void cfq_exit_single_io_context(struct io_context *ioc,
>>>                                       struct cfq_io_context *cic)
>>>  {
>>> -       struct cfq_data *cfqd = cic_to_cfqd(cic);
>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>
>>> -       if (cfqd) {
>>> -               struct request_queue *q = cfqd->queue;
>>> +       if (cfqg) {
>>> +               struct cfq_data *cfqd = cfqg->cfqd;
>>>                unsigned long flags;
>>>
>>> -               spin_lock_irqsave(q->queue_lock, flags);
>>> +               spin_lock_irqsave(&cfqg->lock, flags);
>>>
>>>                /*
>>>                 * Ensure we get a fresh copy of the ->key to prevent
>>>                 * race between exiting task and queue
>>>                 */
>>>                smp_read_barrier_depends();
>>> -               if (cic->key == cfqd)
>>> -                       __cfq_exit_single_io_context(cfqd, cic);
>>> +               if (cic->key == cfqg)
>>> +                       __cfq_exit_single_io_context(cfqd, cfqg, cic);
>>>
>>> -               spin_unlock_irqrestore(q->queue_lock, flags);
>>> +               spin_unlock_irqrestore(&cfqg->lock, flags);
>>>        }
>>>  }
>>>
>>> @@ -2734,7 +2790,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t
>>> gfp_mask)
>>>
>>>  cfqd->queue->node);
>>>        if (cic) {
>>>                cic->last_end_request = jiffies;
>>> -               INIT_LIST_HEAD(&cic->queue_list);
>>> +               INIT_LIST_HEAD(&cic->group_list);
>>>                INIT_HLIST_NODE(&cic->cic_list);
>>>                cic->dtor = cfq_free_io_context;
>>>                cic->exit = cfq_exit_io_context;
>>> @@ -2801,8 +2857,7 @@ static void changed_ioprio(struct io_context *ioc,
>>> struct cfq_io_context *cic)
>>>        cfqq = cic->cfqq[BLK_RW_ASYNC];
>>>        if (cfqq) {
>>>                struct cfq_queue *new_cfqq;
>>> -               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
>>> -                                               GFP_ATOMIC);
>>> +               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic,
>>> GFP_ATOMIC);
>>>                if (new_cfqq) {
>>>                        cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
>>>                        cfq_put_queue(cfqq);
>>> @@ -2879,16 +2934,14 @@ static void cfq_ioc_set_cgroup(struct io_context
>>> *ioc)
>>>
>>>  static struct cfq_queue *
>>>  cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
>>> -                    struct io_context *ioc, gfp_t gfp_mask)
>>> +                    struct cfq_io_context *cic, gfp_t gfp_mask)
>>>  {
>>>        struct cfq_queue *cfqq, *new_cfqq = NULL;
>>> -       struct cfq_io_context *cic;
>>> +       struct io_context *ioc = cic->ioc;
>>>        struct cfq_group *cfqg;
>>>
>>>  retry:
>>> -       cfqg = cfq_get_cfqg(cfqd, NULL, 1);
>>> -       cic = cfq_cic_lookup(cfqd, ioc);
>>> -       /* cic always exists here */
>>> +       cfqg = cic_to_cfqg(cic);
>>>        cfqq = cic_to_cfqq(cic, is_sync);
>>>
>>>        /*
>>> @@ -2930,36 +2983,38 @@ retry:
>>>  }
>>>
>>>  static struct cfq_queue **
>>> -cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int
>>> ioprio)
>>> +cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int
>>> ioprio)
>>>  {
>>>        switch (ioprio_class) {
>>>        case IOPRIO_CLASS_RT:
>>> -               return&cfqd->async_cfqq[0][ioprio];
>>> +               return&cfqg->async_cfqq[0][ioprio];
>>>        case IOPRIO_CLASS_BE:
>>> -               return&cfqd->async_cfqq[1][ioprio];
>>> +               return&cfqg->async_cfqq[1][ioprio];
>>>        case IOPRIO_CLASS_IDLE:
>>> -               return&cfqd->async_idle_cfqq;
>>> +               return&cfqg->async_idle_cfqq;
>>>        default:
>>>                BUG();
>>>        }
>>>  }
>>>
>>>  static struct cfq_queue *
>>> -cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context
>>> *ioc,
>>> +cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_context
>>> *cic,
>>>              gfp_t gfp_mask)
>>>  {
>>> +       struct io_context *ioc = cic->ioc;
>>>        const int ioprio = task_ioprio(ioc);
>>>        const int ioprio_class = task_ioprio_class(ioc);
>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>        struct cfq_queue **async_cfqq = NULL;
>>>        struct cfq_queue *cfqq = NULL;
>>>
>>>        if (!is_sync) {
>>> -               async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class,
>>> ioprio);
>>> +               async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
>>> ioprio);
>>>                cfqq = *async_cfqq;
>>>        }
>>>
>>>        if (!cfqq)
>>> -               cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc,
>>> gfp_mask);
>>> +               cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic,
>>> gfp_mask);
>>>
>>>        /*
>>>         * pin the queue now that it's allocated, scheduler exit will
>>> prune it
>>> @@ -2977,19 +3032,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool
>>> is_sync, struct io_context *ioc,
>>>  * We drop cfq io contexts lazily, so we may find a dead one.
>>>  */
>>>  static void
>>> -cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
>>> -                 struct cfq_io_context *cic)
>>> +cfq_drop_dead_cic(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>> +                 struct io_context *ioc, struct cfq_io_context *cic)
>>>  {
>>>        unsigned long flags;
>>>
>>> -       WARN_ON(!list_empty(&cic->queue_list));
>>> -       BUG_ON(cic->key != cfqd_dead_key(cfqd));
>>> +       WARN_ON(!list_empty(&cic->group_list));
>>> +       BUG_ON(cic->key != cfqg_dead_key(cfqg));
>>>
>>>        spin_lock_irqsave(&ioc->lock, flags);
>>>
>>>        BUG_ON(ioc->ioc_data == cic);
>>>
>>> -       radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
>>> +       radix_tree_delete(&ioc->radix_root, cfqg->cic_index);
>>>        hlist_del_rcu(&cic->cic_list);
>>>        spin_unlock_irqrestore(&ioc->lock, flags);
>>>
>>> @@ -2997,11 +3052,14 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct
>>> io_context *ioc,
>>>  }
>>>
>>>  static struct cfq_io_context *
>>> -cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>>> +cfq_cic_lookup(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>> +              struct io_context *ioc)
>>>  {
>>>        struct cfq_io_context *cic;
>>>        unsigned long flags;
>>>
>>> +       if (!cfqg)
>>> +               return NULL;
>>>        if (unlikely(!ioc))
>>>                return NULL;
>>>
>>> @@ -3011,18 +3069,18 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct
>>> io_context *ioc)
>>>         * we maintain a last-hit cache, to avoid browsing over the tree
>>>         */
>>>        cic = rcu_dereference(ioc->ioc_data);
>>> -       if (cic&&  cic->key == cfqd) {
>>> +       if (cic&&  cic->key == cfqg) {
>>>                rcu_read_unlock();
>>>                return cic;
>>>        }
>>>
>>>        do {
>>> -               cic = radix_tree_lookup(&ioc->radix_root,
>>> cfqd->cic_index);
>>> +               cic = radix_tree_lookup(&ioc->radix_root,
>>> cfqg->cic_index);
>>>                rcu_read_unlock();
>>>                if (!cic)
>>>                        break;
>>> -               if (unlikely(cic->key != cfqd)) {
>>> -                       cfq_drop_dead_cic(cfqd, ioc, cic);
>>> +               if (unlikely(cic->key != cfqg)) {
>>> +                       cfq_drop_dead_cic(cfqd, cfqg, ioc, cic);
>>>                        rcu_read_lock();
>>>                        continue;
>>>                }
>>> @@ -3037,24 +3095,25 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct
>>> io_context *ioc)
>>>  }
>>>
>>>  /*
>>> - * Add cic into ioc, using cfqd as the search key. This enables us to
>>> lookup
>>> + * Add cic into ioc, using cfqg as the search key. This enables us to
>>> lookup
>>>  * the process specific cfq io context when entered from the block layer.
>>> - * Also adds the cic to a per-cfqd list, used when this queue is
>>> removed.
>>> + * Also adds the cic to a per-cfqg list, used when the group is removed.
>>> + * request_queue lock must be held.
>>>  */
>>> -static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>>> -                       struct cfq_io_context *cic, gfp_t gfp_mask)
>>> +static int cfq_cic_link(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>> +                       struct io_context *ioc, struct cfq_io_context
>>> *cic)
>>>  {
>>>        unsigned long flags;
>>>        int ret;
>>>
>>> -       ret = radix_tree_preload(gfp_mask);
>>> +       ret = radix_tree_preload(GFP_ATOMIC);
>>>        if (!ret) {
>>>                cic->ioc = ioc;
>>> -               cic->key = cfqd;
>>> +               cic->key = cfqg;
>>>
>>>                spin_lock_irqsave(&ioc->lock, flags);
>>>                ret = radix_tree_insert(&ioc->radix_root,
>>> -                                               cfqd->cic_index, cic);
>>> +                                               cfqg->cic_index, cic);
>>>                if (!ret)
>>>                        hlist_add_head_rcu(&cic->cic_list,&ioc->cic_list);
>>>                spin_unlock_irqrestore(&ioc->lock, flags);
>>> @@ -3062,9 +3121,9 @@ static int cfq_cic_link(struct cfq_data *cfqd,
>>> struct io_context *ioc,
>>>                radix_tree_preload_end();
>>>
>>>                if (!ret) {
>>> -                       spin_lock_irqsave(cfqd->queue->queue_lock,
>>> flags);
>>> -                       list_add(&cic->queue_list,&cfqd->cic_list);
>>> -                       spin_unlock_irqrestore(cfqd->queue->queue_lock,
>>> flags);
>>> +                       spin_lock_irqsave(&cfqg->lock, flags);
>>> +                       list_add(&cic->group_list,&cfqg->cic_list);
>>> +                       spin_unlock_irqrestore(&cfqg->lock, flags);
>>>                }
>>>        }
>>>
>>> @@ -3080,10 +3139,12 @@ static int cfq_cic_link(struct cfq_data *cfqd,
>>> struct io_context *ioc,
>>>  * than one device managed by cfq.
>>>  */
>>>  static struct cfq_io_context *
>>> -cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>>> +cfq_get_io_context(struct cfq_data *cfqd, struct bio *bio, gfp_t
>>> gfp_mask)
>>>  {
>>>        struct io_context *ioc = NULL;
>>>        struct cfq_io_context *cic;
>>> +       struct cfq_group *cfqg, *cfqg2;
>>> +       unsigned long flags;
>>>
>>>        might_sleep_if(gfp_mask&  __GFP_WAIT);
>>>
>>> @@ -3091,18 +3152,38 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t
>>> gfp_mask)
>>>        if (!ioc)
>>>                return NULL;
>>>
>>> -       cic = cfq_cic_lookup(cfqd, ioc);
>>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>>> +retry_cfqg:
>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 1);
>>> +retry_cic:
>>> +       cic = cfq_cic_lookup(cfqd, cfqg, ioc);
>>>        if (cic)
>>>                goto out;
>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>
>>>        cic = cfq_alloc_io_context(cfqd, gfp_mask);
>>>        if (cic == NULL)
>>>                goto err;
>>>
>>> -       if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
>>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>>> +
>>> +       /* check the consistency breakage during unlock period */
>>> +       cfqg2 = cfq_get_cfqg(cfqd, bio, 0);
>>> +       if (cfqg != cfqg2) {
>>> +               cfq_cic_free(cic);
>>> +               if (!cfqg2)
>>> +                       goto retry_cfqg;
>>> +               else {
>>> +                       cfqg = cfqg2;
>>> +                       goto retry_cic;
>>> +               }
>>> +       }
>>> +
>>> +       if (cfq_cic_link(cfqd, cfqg, ioc, cic))
>>>                goto err_free;
>>>
>>>  out:
>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>        smp_read_barrier_depends();
>>>        if (unlikely(ioc->ioprio_changed))
>>>                cfq_ioc_set_ioprio(ioc);
>>> @@ -3113,6 +3194,7 @@ out:
>>>  #endif
>>>        return cic;
>>>  err_free:
>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>        cfq_cic_free(cic);
>>>  err:
>>>        put_io_context(ioc);
>>> @@ -3537,6 +3619,7 @@ static inline int __cfq_may_queue(struct cfq_queue
>>> *cfqq)
>>>  static int cfq_may_queue(struct request_queue *q, struct bio *bio, int
>>> rw)
>>>  {
>>>        struct cfq_data *cfqd = q->elevator->elevator_data;
>>> +       struct cfq_group *cfqg;
>>>        struct task_struct *tsk = current;
>>>        struct cfq_io_context *cic;
>>>        struct cfq_queue *cfqq;
>>> @@ -3547,7 +3630,8 @@ static int cfq_may_queue(struct request_queue *q,
>>> struct bio *bio, int rw)
>>>         * so just lookup a possibly existing queue, or return 'may queue'
>>>         * if that fails
>>>         */
>>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>>        if (!cic)
>>>                return ELV_MQUEUE_MAY;
>>>
>>> @@ -3636,7 +3720,7 @@ cfq_set_request(struct request_queue *q, struct
>>> request *rq, struct bio *bio,
>>>
>>>        might_sleep_if(gfp_mask&  __GFP_WAIT);
>>>
>>> -       cic = cfq_get_io_context(cfqd, gfp_mask);
>>> +       cic = cfq_get_io_context(cfqd, bio, gfp_mask);
>>>
>>>        spin_lock_irqsave(q->queue_lock, flags);
>>>
>>> @@ -3646,7 +3730,7 @@ cfq_set_request(struct request_queue *q, struct
>>> request *rq, struct bio *bio,
>>>  new_queue:
>>>        cfqq = cic_to_cfqq(cic, is_sync);
>>>        if (!cfqq || cfqq ==&cfqd->oom_cfqq) {
>>> -               cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
>>> +               cfqq = cfq_get_queue(cfqd, is_sync, cic, gfp_mask);
>>>                cic_set_cfqq(cic, cfqq, is_sync);
>>>        } else {
>>>                /*
>>> @@ -3762,19 +3846,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data
>>> *cfqd)
>>>        cancel_work_sync(&cfqd->unplug_work);
>>>  }
>>>
>>> -static void cfq_put_async_queues(struct cfq_data *cfqd)
>>> +static void cfq_put_async_queues(struct cfq_group *cfqg)
>>>  {
>>>        int i;
>>>
>>>        for (i = 0; i<  IOPRIO_BE_NR; i++) {
>>> -               if (cfqd->async_cfqq[0][i])
>>> -                       cfq_put_queue(cfqd->async_cfqq[0][i]);
>>> -               if (cfqd->async_cfqq[1][i])
>>> -                       cfq_put_queue(cfqd->async_cfqq[1][i]);
>>> +               if (cfqg->async_cfqq[0][i])
>>> +                       cfq_put_queue(cfqg->async_cfqq[0][i]);
>>> +               if (cfqg->async_cfqq[1][i])
>>> +                       cfq_put_queue(cfqg->async_cfqq[1][i]);
>>>        }
>>>
>>> -       if (cfqd->async_idle_cfqq)
>>> -               cfq_put_queue(cfqd->async_idle_cfqq);
>>> +       if (cfqg->async_idle_cfqq)
>>> +               cfq_put_queue(cfqg->async_idle_cfqq);
>>>  }
>>>
>>>  static void cfq_cfqd_free(struct rcu_head *head)
>>> @@ -3794,15 +3878,6 @@ static void cfq_exit_queue(struct elevator_queue
>>> *e)
>>>        if (cfqd->active_queue)
>>>                __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
>>>
>>> -       while (!list_empty(&cfqd->cic_list)) {
>>> -               struct cfq_io_context *cic =
>>> list_entry(cfqd->cic_list.next,
>>> -                                                       struct
>>> cfq_io_context,
>>> -                                                       queue_list);
>>> -
>>> -               __cfq_exit_single_io_context(cfqd, cic);
>>> -       }
>>> -
>>> -       cfq_put_async_queues(cfqd);
>>>        cfq_release_cfq_groups(cfqd);
>>>        cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
>>>
>>> @@ -3810,10 +3885,6 @@ static void cfq_exit_queue(struct elevator_queue
>>> *e)
>>>
>>>        cfq_shutdown_timer_wq(cfqd);
>>>
>>> -       spin_lock(&cic_index_lock);
>>> -       ida_remove(&cic_index_ida, cfqd->cic_index);
>>> -       spin_unlock(&cic_index_lock);
>>> -
>>>        /* Wait for cfqg->blkg->key accessors to exit their grace periods.
>>> */
>>>        call_rcu(&cfqd->rcu, cfq_cfqd_free);
>>>  }
>>> @@ -3823,7 +3894,7 @@ static int cfq_alloc_cic_index(void)
>>>        int index, error;
>>>
>>>        do {
>>> -               if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
>>> +               if (!ida_pre_get(&cic_index_ida, GFP_ATOMIC))
>>>                        return -ENOMEM;
>>>
>>>                spin_lock(&cic_index_lock);
>>> @@ -3839,20 +3910,18 @@ static int cfq_alloc_cic_index(void)
>>>  static void *cfq_init_queue(struct request_queue *q)
>>>  {
>>>        struct cfq_data *cfqd;
>>> -       int i, j;
>>> +       int i, j, idx;
>>>        struct cfq_group *cfqg;
>>>        struct cfq_rb_root *st;
>>>
>>> -       i = cfq_alloc_cic_index();
>>> -       if (i<  0)
>>> +       idx = cfq_alloc_cic_index();
>>> +       if (idx<  0)
>>>                return NULL;
>>>
>>>        cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO,
>>> q->node);
>>>        if (!cfqd)
>>>                return NULL;
>>>
>>> -       cfqd->cic_index = i;
>>> -
>>>        /* Init root service tree */
>>>        cfqd->grp_service_tree = CFQ_RB_ROOT;
>>>
>>> @@ -3861,6 +3930,9 @@ static void *cfq_init_queue(struct request_queue
>>> *q)
>>>        for_each_cfqg_st(cfqg, i, j, st)
>>>                *st = CFQ_RB_ROOT;
>>>        RB_CLEAR_NODE(&cfqg->rb_node);
>>> +       cfqg->cfqd = cfqd;
>>> +       cfqg->cic_index = idx;
>>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>>>
>>>        /* Give preference to root group over other groups */
>>>        cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
>>> @@ -3874,6 +3946,7 @@ static void *cfq_init_queue(struct request_queue
>>> *q)
>>>        rcu_read_lock();
>>>        cfq_blkiocg_add_blkio_group(&blkio_root_cgroup,&cfqg->blkg,
>>>                                        (void *)cfqd, 0);
>>> +       hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>>>        rcu_read_unlock();
>>>  #endif
>>>        /*
>>> @@ -3893,8 +3966,6 @@ static void *cfq_init_queue(struct request_queue
>>> *q)
>>>        atomic_inc(&cfqd->oom_cfqq.ref);
>>>        cfq_link_cfqq_cfqg(&cfqd->oom_cfqq,&cfqd->root_group);
>>>
>>> -       INIT_LIST_HEAD(&cfqd->cic_list);
>>> -
>>>        cfqd->queue = q;
>>>
>>>        init_timer(&cfqd->idle_slice_timer);
>>> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
>>> index 64d5291..6c05b54 100644
>>> --- a/include/linux/iocontext.h
>>> +++ b/include/linux/iocontext.h
>>> @@ -18,7 +18,7 @@ struct cfq_io_context {
>>>        unsigned long ttime_samples;
>>>        unsigned long ttime_mean;
>>>
>>> -       struct list_head queue_list;
>>> +       struct list_head group_list;
>>>        struct hlist_node cic_list;
>>>
>>>        void (*dtor)(struct io_context *); /* destructor */
>>> --
>>> 1.6.2.5
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
>
> --
> IKEDA, Munehiro
>  NEC Corporation of America
>    m-ikeda@ds.jp.nec.com
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group
  2010-08-13 23:01       ` Nauman Rafique
@ 2010-08-14  0:49         ` Munehiro Ikeda
  0 siblings, 0 replies; 53+ messages in thread
From: Munehiro Ikeda @ 2010-08-14  0:49 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: linux-kernel, Vivek Goyal, Ryo Tsuruta, taka, kamezawa.hiroyu,
	Andrea Righi, Gui Jianfeng, akpm, balbir

Nauman Rafique wrote, on 08/13/2010 07:01 PM:
> On Fri, Aug 13, 2010 at 2:00 PM, Munehiro Ikeda<m-ikeda@ds.jp.nec.com>  wrote:
>> Nauman Rafique wrote, on 08/12/2010 09:24 PM:
>>>
>>> Muuhh,
>>> I do not understand the motivation behind making cfq_io_context per
>>> cgroup. We can extract cgroup id from bio, use that to lookup cgroup
>>> and its async queues. What am I missing?
>>
>> What cgroup ID can suggest is only cgroup to which the thread belongs,
>> but not the thread itself.  This means that IO priority and IO prio-class
>> can't be determined by cgroup ID.
>
> One way to do it would be to get ioprio and class from the context
> that is used to submit the async request. IO priority and class is
> tied to a thread anyways. And the io context of that thread can be
> used to retrieve those values. Even if a thread is submitting IOs to
> different cgroups, I don't see how you can apply different IO priority
> and class to its async IOs for different cgroups. Please let me know
> if it does not make sense.

My talking about IO prio might be misleading and pointless, sorry.
I was confused IO context of flush thread and thread which dirtied
a page.

Then, reason why making cfq_io_context per cgroup.
That is simply because the current code retrieves cfqq from
cfq_io_context by cic_to_cfqq().

As you said, we can lookup cgroup by cgroup ID and its async queue.
This is done by cfq_get_queue() and cfq_async_queue_prio().  So if
we change all call of cic_to_cfqq() into cfq_get_queue() (or slightly
changed version of cfq_get_queue() ), we may avoid making cfq_io_context
per cgroup.

Which approach is better depends on the complexity of the patch.  I chose
the former approach, but if the second approach is more simple,
it is better.  I need to think it over and am feeling that the second
approach would be nicer.  Thanks for the suggestion.


>> The pointers of async queues are held in cfqg->async_cfqq[class][prio].
>> It is impossible to find out which queue is appropriate by only cgroup
>> ID if considering IO priority.
>>
>> Frankly speaking, I'm not 100% confident if IO priority should be
>> applied on async write IOs, but anyway, I made up my mind to keep the
>> current scheme.
>>
>> Do I make sense?  If you have any other idea, I am glad to hear.
>>
>>
>> Thanks,
>> Muuhh
>>
>>
>>>
>>> On Thu, Jul 8, 2010 at 8:22 PM, Munehiro Ikeda<m-ikeda@ds.jp.nec.com>
>>>   wrote:
>>>>
>>>> This is the main body for async capability of blkio controller.
>>>> The basic ideas are
>>>>   - To move async queues from cfq_data to cfq_group, and
>>>>   - To associate cfq_io_context with cfq_group
>>>>
>>>> Each cfq_io_context, which was created per an io_context
>>>> per a device, is now created per an io_context per a cfq_group.
>>>> Each cfq_group is created per a cgroup per a device, so
>>>> cfq_io_context is now resulted in to be created per
>>>> an io_context per a cgroup per a device.
>>>>
>>>> To protect link between cfq_io_context and cfq_group,
>>>> locking code is modified in several parts.
>>>>
>>>> ToDo:
>>>> - Lock protection still might be imperfect and more investigation
>>>>   is needed.
>>>>
>>>> - cic_index of root cfq_group (cfqd->root_group.cic_index) is not
>>>>   removed and lasts beyond elevator switching.
>>>>   This issues should be solved.
>>>>
>>>> Signed-off-by: Munehiro "Muuhh" Ikeda<m-ikeda@ds.jp.nec.com>
>>>> ---
>>>>   block/cfq-iosched.c       |  277
>>>> ++++++++++++++++++++++++++++-----------------
>>>>   include/linux/iocontext.h |    2 +-
>>>>   2 files changed, 175 insertions(+), 104 deletions(-)
>>>>
>>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>>> index 68f76e9..4186c30 100644
>>>> --- a/block/cfq-iosched.c
>>>> +++ b/block/cfq-iosched.c
>>>> @@ -191,10 +191,23 @@ struct cfq_group {
>>>>         struct cfq_rb_root service_trees[2][3];
>>>>         struct cfq_rb_root service_tree_idle;
>>>>
>>>> +       /*
>>>> +        * async queue for each priority case
>>>> +        */
>>>> +       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>>>> +       struct cfq_queue *async_idle_cfqq;
>>>> +
>>>>         unsigned long saved_workload_slice;
>>>>         enum wl_type_t saved_workload;
>>>>         enum wl_prio_t saved_serving_prio;
>>>>         struct blkio_group blkg;
>>>> +       struct cfq_data *cfqd;
>>>> +
>>>> +       /* lock for cic_list */
>>>> +       spinlock_t lock;
>>>> +       unsigned int cic_index;
>>>> +       struct list_head cic_list;
>>>> +
>>>>   #ifdef CONFIG_CFQ_GROUP_IOSCHED
>>>>         struct hlist_node cfqd_node;
>>>>         atomic_t ref;
>>>> @@ -254,12 +267,6 @@ struct cfq_data {
>>>>         struct cfq_queue *active_queue;
>>>>         struct cfq_io_context *active_cic;
>>>>
>>>> -       /*
>>>> -        * async queue for each priority case
>>>> -        */
>>>> -       struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
>>>> -       struct cfq_queue *async_idle_cfqq;
>>>> -
>>>>         sector_t last_position;
>>>>
>>>>         /*
>>>> @@ -275,8 +282,6 @@ struct cfq_data {
>>>>         unsigned int cfq_latency;
>>>>         unsigned int cfq_group_isolation;
>>>>
>>>> -       unsigned int cic_index;
>>>> -       struct list_head cic_list;
>>>>
>>>>         /*
>>>>          * Fallback dummy cfqq for extreme OOM conditions
>>>> @@ -418,10 +423,16 @@ static inline int cfqg_busy_async_queues(struct
>>>> cfq_data *cfqd,
>>>>   }
>>>>
>>>>   static void cfq_dispatch_insert(struct request_queue *, struct request
>>>> *);
>>>> +static void __cfq_exit_single_io_context(struct cfq_data *,
>>>> +                                        struct cfq_group *,
>>>> +                                        struct cfq_io_context *);
>>>>   static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
>>>> -                                      struct io_context *, gfp_t);
>>>> +                                      struct cfq_io_context *, gfp_t);
>>>>   static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
>>>> -                                               struct io_context *);
>>>> +                                            struct cfq_group *,
>>>> +                                            struct io_context *);
>>>> +static void cfq_put_async_queues(struct cfq_group *);
>>>> +static int cfq_alloc_cic_index(void);
>>>>
>>>>   static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
>>>>                                             bool is_sync)
>>>> @@ -438,17 +449,28 @@ static inline void cic_set_cfqq(struct
>>>> cfq_io_context *cic,
>>>>   #define CIC_DEAD_KEY   1ul
>>>>   #define CIC_DEAD_INDEX_SHIFT   1
>>>>
>>>> -static inline void *cfqd_dead_key(struct cfq_data *cfqd)
>>>> +static inline void *cfqg_dead_key(struct cfq_group *cfqg)
>>>>   {
>>>> -       return (void *)(cfqd->cic_index<<    CIC_DEAD_INDEX_SHIFT |
>>>> CIC_DEAD_KEY);
>>>> +       return (void *)(cfqg->cic_index<<    CIC_DEAD_INDEX_SHIFT |
>>>> CIC_DEAD_KEY);
>>>> +}
>>>> +
>>>> +static inline struct cfq_group *cic_to_cfqg(struct cfq_io_context *cic)
>>>> +{
>>>> +       struct cfq_group *cfqg = cic->key;
>>>> +
>>>> +       if (unlikely((unsigned long) cfqg&    CIC_DEAD_KEY))
>>>> +               cfqg = NULL;
>>>> +
>>>> +       return cfqg;
>>>>   }
>>>>
>>>>   static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
>>>>   {
>>>> -       struct cfq_data *cfqd = cic->key;
>>>> +       struct cfq_data *cfqd = NULL;
>>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>>
>>>> -       if (unlikely((unsigned long) cfqd&    CIC_DEAD_KEY))
>>>> -               return NULL;
>>>> +       if (likely(cfqg))
>>>> +               cfqd =  cfqg->cfqd;
>>>>
>>>>         return cfqd;
>>>>   }
>>>> @@ -959,12 +981,12 @@ cfq_update_blkio_group_weight(struct blkio_group
>>>> *blkg, unsigned int weight)
>>>>   }
>>>>
>>>>   static struct cfq_group *
>>>> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int
>>>> create)
>>>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
>>>> +                   int create)
>>>>   {
>>>> -       struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>>         struct cfq_group *cfqg = NULL;
>>>>         void *key = cfqd;
>>>> -       int i, j;
>>>> +       int i, j, idx;
>>>>         struct cfq_rb_root *st;
>>>>         struct backing_dev_info *bdi =&cfqd->queue->backing_dev_info;
>>>>         unsigned int major, minor;
>>>> @@ -978,14 +1000,21 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct
>>>> cgroup *cgroup, int create)
>>>>         if (cfqg || !create)
>>>>                 goto done;
>>>>
>>>> +       idx = cfq_alloc_cic_index();
>>>> +       if (idx<    0)
>>>> +               goto done;
>>>> +
>>>>         cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>>>>         if (!cfqg)
>>>> -               goto done;
>>>> +               goto err;
>>>>
>>>>         for_each_cfqg_st(cfqg, i, j, st)
>>>>                 *st = CFQ_RB_ROOT;
>>>>         RB_CLEAR_NODE(&cfqg->rb_node);
>>>>
>>>> +       spin_lock_init(&cfqg->lock);
>>>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>>>> +
>>>>         /*
>>>>          * Take the initial reference that will be released on destroy
>>>>          * This can be thought of a joint reference by cgroup and
>>>> @@ -1002,7 +1031,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct
>>>> cgroup *cgroup, int create)
>>>>
>>>>         /* Add group on cfqd list */
>>>>         hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>>>> +       cfqg->cfqd = cfqd;
>>>> +       cfqg->cic_index = idx;
>>>> +       goto done;
>>>>
>>>> +err:
>>>> +       spin_lock(&cic_index_lock);
>>>> +       ida_remove(&cic_index_ida, idx);
>>>> +       spin_unlock(&cic_index_lock);
>>>>   done:
>>>>         return cfqg;
>>>>   }
>>>> @@ -1095,10 +1131,6 @@ static inline struct cfq_group
>>>> *cfq_ref_get_cfqg(struct cfq_group *cfqg)
>>>>
>>>>   static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group
>>>> *cfqg)
>>>>   {
>>>> -       /* Currently, all async queues are mapped to root group */
>>>> -       if (!cfq_cfqq_sync(cfqq))
>>>> -               cfqg =&cfqq->cfqd->root_group;
>>>> -
>>>>         cfqq->cfqg = cfqg;
>>>>         /* cfqq reference on cfqg */
>>>>         atomic_inc(&cfqq->cfqg->ref);
>>>> @@ -1114,6 +1146,16 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>>>>                 return;
>>>>         for_each_cfqg_st(cfqg, i, j, st)
>>>>                 BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
>>>> +
>>>> +       /**
>>>> +        * ToDo:
>>>> +        * root_group never reaches here and cic_index is never
>>>> +        * removed.  Unused cic_index lasts by elevator switching.
>>>> +        */
>>>> +       spin_lock(&cic_index_lock);
>>>> +       ida_remove(&cic_index_ida, cfqg->cic_index);
>>>> +       spin_unlock(&cic_index_lock);
>>>> +
>>>>         kfree(cfqg);
>>>>   }
>>>>
>>>> @@ -1122,6 +1164,15 @@ static void cfq_destroy_cfqg(struct cfq_data
>>>> *cfqd, struct cfq_group *cfqg)
>>>>         /* Something wrong if we are trying to remove same group twice */
>>>>         BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
>>>>
>>>> +       while (!list_empty(&cfqg->cic_list)) {
>>>> +               struct cfq_io_context *cic =
>>>> list_entry(cfqg->cic_list.next,
>>>> +                                                       struct
>>>> cfq_io_context,
>>>> +                                                       group_list);
>>>> +
>>>> +               __cfq_exit_single_io_context(cfqd, cfqg, cic);
>>>> +       }
>>>> +
>>>> +       cfq_put_async_queues(cfqg);
>>>>         hlist_del_init(&cfqg->cfqd_node);
>>>>
>>>>         /*
>>>> @@ -1497,10 +1548,12 @@ static struct request *
>>>>   cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
>>>>   {
>>>>         struct task_struct *tsk = current;
>>>> +       struct cfq_group *cfqg;
>>>>         struct cfq_io_context *cic;
>>>>         struct cfq_queue *cfqq;
>>>>
>>>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>>>         if (!cic)
>>>>                 return NULL;
>>>>
>>>> @@ -1611,6 +1664,7 @@ static int cfq_allow_merge(struct request_queue *q,
>>>> struct request *rq,
>>>>                            struct bio *bio)
>>>>   {
>>>>         struct cfq_data *cfqd = q->elevator->elevator_data;
>>>> +       struct cfq_group *cfqg;
>>>>         struct cfq_io_context *cic;
>>>>         struct cfq_queue *cfqq;
>>>>
>>>> @@ -1624,7 +1678,8 @@ static int cfq_allow_merge(struct request_queue *q,
>>>> struct request *rq,
>>>>          * Lookup the cfqq that this bio will be queued with. Allow
>>>>          * merge only if rq is queued there.
>>>>          */
>>>> -       cic = cfq_cic_lookup(cfqd, current->io_context);
>>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>>> +       cic = cfq_cic_lookup(cfqd, cfqg, current->io_context);
>>>>         if (!cic)
>>>>                 return false;
>>>>
>>>> @@ -2667,17 +2722,18 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd,
>>>> struct cfq_queue *cfqq)
>>>>   }
>>>>
>>>>   static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
>>>> +                                        struct cfq_group *cfqg,
>>>>                                          struct cfq_io_context *cic)
>>>>   {
>>>>         struct io_context *ioc = cic->ioc;
>>>>
>>>> -       list_del_init(&cic->queue_list);
>>>> +       list_del_init(&cic->group_list);
>>>>
>>>>         /*
>>>>          * Make sure dead mark is seen for dead queues
>>>>          */
>>>>         smp_wmb();
>>>> -       cic->key = cfqd_dead_key(cfqd);
>>>> +       cic->key = cfqg_dead_key(cfqg);
>>>>
>>>>         if (ioc->ioc_data == cic)
>>>>                 rcu_assign_pointer(ioc->ioc_data, NULL);
>>>> @@ -2696,23 +2752,23 @@ static void __cfq_exit_single_io_context(struct
>>>> cfq_data *cfqd,
>>>>   static void cfq_exit_single_io_context(struct io_context *ioc,
>>>>                                        struct cfq_io_context *cic)
>>>>   {
>>>> -       struct cfq_data *cfqd = cic_to_cfqd(cic);
>>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>>
>>>> -       if (cfqd) {
>>>> -               struct request_queue *q = cfqd->queue;
>>>> +       if (cfqg) {
>>>> +               struct cfq_data *cfqd = cfqg->cfqd;
>>>>                 unsigned long flags;
>>>>
>>>> -               spin_lock_irqsave(q->queue_lock, flags);
>>>> +               spin_lock_irqsave(&cfqg->lock, flags);
>>>>
>>>>                 /*
>>>>                  * Ensure we get a fresh copy of the ->key to prevent
>>>>                  * race between exiting task and queue
>>>>                  */
>>>>                 smp_read_barrier_depends();
>>>> -               if (cic->key == cfqd)
>>>> -                       __cfq_exit_single_io_context(cfqd, cic);
>>>> +               if (cic->key == cfqg)
>>>> +                       __cfq_exit_single_io_context(cfqd, cfqg, cic);
>>>>
>>>> -               spin_unlock_irqrestore(q->queue_lock, flags);
>>>> +               spin_unlock_irqrestore(&cfqg->lock, flags);
>>>>         }
>>>>   }
>>>>
>>>> @@ -2734,7 +2790,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t
>>>> gfp_mask)
>>>>
>>>>   cfqd->queue->node);
>>>>         if (cic) {
>>>>                 cic->last_end_request = jiffies;
>>>> -               INIT_LIST_HEAD(&cic->queue_list);
>>>> +               INIT_LIST_HEAD(&cic->group_list);
>>>>                 INIT_HLIST_NODE(&cic->cic_list);
>>>>                 cic->dtor = cfq_free_io_context;
>>>>                 cic->exit = cfq_exit_io_context;
>>>> @@ -2801,8 +2857,7 @@ static void changed_ioprio(struct io_context *ioc,
>>>> struct cfq_io_context *cic)
>>>>         cfqq = cic->cfqq[BLK_RW_ASYNC];
>>>>         if (cfqq) {
>>>>                 struct cfq_queue *new_cfqq;
>>>> -               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
>>>> -                                               GFP_ATOMIC);
>>>> +               new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic,
>>>> GFP_ATOMIC);
>>>>                 if (new_cfqq) {
>>>>                         cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
>>>>                         cfq_put_queue(cfqq);
>>>> @@ -2879,16 +2934,14 @@ static void cfq_ioc_set_cgroup(struct io_context
>>>> *ioc)
>>>>
>>>>   static struct cfq_queue *
>>>>   cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
>>>> -                    struct io_context *ioc, gfp_t gfp_mask)
>>>> +                    struct cfq_io_context *cic, gfp_t gfp_mask)
>>>>   {
>>>>         struct cfq_queue *cfqq, *new_cfqq = NULL;
>>>> -       struct cfq_io_context *cic;
>>>> +       struct io_context *ioc = cic->ioc;
>>>>         struct cfq_group *cfqg;
>>>>
>>>>   retry:
>>>> -       cfqg = cfq_get_cfqg(cfqd, NULL, 1);
>>>> -       cic = cfq_cic_lookup(cfqd, ioc);
>>>> -       /* cic always exists here */
>>>> +       cfqg = cic_to_cfqg(cic);
>>>>         cfqq = cic_to_cfqq(cic, is_sync);
>>>>
>>>>         /*
>>>> @@ -2930,36 +2983,38 @@ retry:
>>>>   }
>>>>
>>>>   static struct cfq_queue **
>>>> -cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int
>>>> ioprio)
>>>> +cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int
>>>> ioprio)
>>>>   {
>>>>         switch (ioprio_class) {
>>>>         case IOPRIO_CLASS_RT:
>>>> -               return&cfqd->async_cfqq[0][ioprio];
>>>> +               return&cfqg->async_cfqq[0][ioprio];
>>>>         case IOPRIO_CLASS_BE:
>>>> -               return&cfqd->async_cfqq[1][ioprio];
>>>> +               return&cfqg->async_cfqq[1][ioprio];
>>>>         case IOPRIO_CLASS_IDLE:
>>>> -               return&cfqd->async_idle_cfqq;
>>>> +               return&cfqg->async_idle_cfqq;
>>>>         default:
>>>>                 BUG();
>>>>         }
>>>>   }
>>>>
>>>>   static struct cfq_queue *
>>>> -cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context
>>>> *ioc,
>>>> +cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_context
>>>> *cic,
>>>>               gfp_t gfp_mask)
>>>>   {
>>>> +       struct io_context *ioc = cic->ioc;
>>>>         const int ioprio = task_ioprio(ioc);
>>>>         const int ioprio_class = task_ioprio_class(ioc);
>>>> +       struct cfq_group *cfqg = cic_to_cfqg(cic);
>>>>         struct cfq_queue **async_cfqq = NULL;
>>>>         struct cfq_queue *cfqq = NULL;
>>>>
>>>>         if (!is_sync) {
>>>> -               async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class,
>>>> ioprio);
>>>> +               async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
>>>> ioprio);
>>>>                 cfqq = *async_cfqq;
>>>>         }
>>>>
>>>>         if (!cfqq)
>>>> -               cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc,
>>>> gfp_mask);
>>>> +               cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic,
>>>> gfp_mask);
>>>>
>>>>         /*
>>>>          * pin the queue now that it's allocated, scheduler exit will
>>>> prune it
>>>> @@ -2977,19 +3032,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool
>>>> is_sync, struct io_context *ioc,
>>>>   * We drop cfq io contexts lazily, so we may find a dead one.
>>>>   */
>>>>   static void
>>>> -cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
>>>> -                 struct cfq_io_context *cic)
>>>> +cfq_drop_dead_cic(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> +                 struct io_context *ioc, struct cfq_io_context *cic)
>>>>   {
>>>>         unsigned long flags;
>>>>
>>>> -       WARN_ON(!list_empty(&cic->queue_list));
>>>> -       BUG_ON(cic->key != cfqd_dead_key(cfqd));
>>>> +       WARN_ON(!list_empty(&cic->group_list));
>>>> +       BUG_ON(cic->key != cfqg_dead_key(cfqg));
>>>>
>>>>         spin_lock_irqsave(&ioc->lock, flags);
>>>>
>>>>         BUG_ON(ioc->ioc_data == cic);
>>>>
>>>> -       radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
>>>> +       radix_tree_delete(&ioc->radix_root, cfqg->cic_index);
>>>>         hlist_del_rcu(&cic->cic_list);
>>>>         spin_unlock_irqrestore(&ioc->lock, flags);
>>>>
>>>> @@ -2997,11 +3052,14 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct
>>>> io_context *ioc,
>>>>   }
>>>>
>>>>   static struct cfq_io_context *
>>>> -cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
>>>> +cfq_cic_lookup(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> +              struct io_context *ioc)
>>>>   {
>>>>         struct cfq_io_context *cic;
>>>>         unsigned long flags;
>>>>
>>>> +       if (!cfqg)
>>>> +               return NULL;
>>>>         if (unlikely(!ioc))
>>>>                 return NULL;
>>>>
>>>> @@ -3011,18 +3069,18 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct
>>>> io_context *ioc)
>>>>          * we maintain a last-hit cache, to avoid browsing over the tree
>>>>          */
>>>>         cic = rcu_dereference(ioc->ioc_data);
>>>> -       if (cic&&    cic->key == cfqd) {
>>>> +       if (cic&&    cic->key == cfqg) {
>>>>                 rcu_read_unlock();
>>>>                 return cic;
>>>>         }
>>>>
>>>>         do {
>>>> -               cic = radix_tree_lookup(&ioc->radix_root,
>>>> cfqd->cic_index);
>>>> +               cic = radix_tree_lookup(&ioc->radix_root,
>>>> cfqg->cic_index);
>>>>                 rcu_read_unlock();
>>>>                 if (!cic)
>>>>                         break;
>>>> -               if (unlikely(cic->key != cfqd)) {
>>>> -                       cfq_drop_dead_cic(cfqd, ioc, cic);
>>>> +               if (unlikely(cic->key != cfqg)) {
>>>> +                       cfq_drop_dead_cic(cfqd, cfqg, ioc, cic);
>>>>                         rcu_read_lock();
>>>>                         continue;
>>>>                 }
>>>> @@ -3037,24 +3095,25 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct
>>>> io_context *ioc)
>>>>   }
>>>>
>>>>   /*
>>>> - * Add cic into ioc, using cfqd as the search key. This enables us to
>>>> lookup
>>>> + * Add cic into ioc, using cfqg as the search key. This enables us to
>>>> lookup
>>>>   * the process specific cfq io context when entered from the block layer.
>>>> - * Also adds the cic to a per-cfqd list, used when this queue is
>>>> removed.
>>>> + * Also adds the cic to a per-cfqg list, used when the group is removed.
>>>> + * request_queue lock must be held.
>>>>   */
>>>> -static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
>>>> -                       struct cfq_io_context *cic, gfp_t gfp_mask)
>>>> +static int cfq_cic_link(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> +                       struct io_context *ioc, struct cfq_io_context
>>>> *cic)
>>>>   {
>>>>         unsigned long flags;
>>>>         int ret;
>>>>
>>>> -       ret = radix_tree_preload(gfp_mask);
>>>> +       ret = radix_tree_preload(GFP_ATOMIC);
>>>>         if (!ret) {
>>>>                 cic->ioc = ioc;
>>>> -               cic->key = cfqd;
>>>> +               cic->key = cfqg;
>>>>
>>>>                 spin_lock_irqsave(&ioc->lock, flags);
>>>>                 ret = radix_tree_insert(&ioc->radix_root,
>>>> -                                               cfqd->cic_index, cic);
>>>> +                                               cfqg->cic_index, cic);
>>>>                 if (!ret)
>>>>                         hlist_add_head_rcu(&cic->cic_list,&ioc->cic_list);
>>>>                 spin_unlock_irqrestore(&ioc->lock, flags);
>>>> @@ -3062,9 +3121,9 @@ static int cfq_cic_link(struct cfq_data *cfqd,
>>>> struct io_context *ioc,
>>>>                 radix_tree_preload_end();
>>>>
>>>>                 if (!ret) {
>>>> -                       spin_lock_irqsave(cfqd->queue->queue_lock,
>>>> flags);
>>>> -                       list_add(&cic->queue_list,&cfqd->cic_list);
>>>> -                       spin_unlock_irqrestore(cfqd->queue->queue_lock,
>>>> flags);
>>>> +                       spin_lock_irqsave(&cfqg->lock, flags);
>>>> +                       list_add(&cic->group_list,&cfqg->cic_list);
>>>> +                       spin_unlock_irqrestore(&cfqg->lock, flags);
>>>>                 }
>>>>         }
>>>>
>>>> @@ -3080,10 +3139,12 @@ static int cfq_cic_link(struct cfq_data *cfqd,
>>>> struct io_context *ioc,
>>>>   * than one device managed by cfq.
>>>>   */
>>>>   static struct cfq_io_context *
>>>> -cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
>>>> +cfq_get_io_context(struct cfq_data *cfqd, struct bio *bio, gfp_t
>>>> gfp_mask)
>>>>   {
>>>>         struct io_context *ioc = NULL;
>>>>         struct cfq_io_context *cic;
>>>> +       struct cfq_group *cfqg, *cfqg2;
>>>> +       unsigned long flags;
>>>>
>>>>         might_sleep_if(gfp_mask&    __GFP_WAIT);
>>>>
>>>> @@ -3091,18 +3152,38 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t
>>>> gfp_mask)
>>>>         if (!ioc)
>>>>                 return NULL;
>>>>
>>>> -       cic = cfq_cic_lookup(cfqd, ioc);
>>>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>>>> +retry_cfqg:
>>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 1);
>>>> +retry_cic:
>>>> +       cic = cfq_cic_lookup(cfqd, cfqg, ioc);
>>>>         if (cic)
>>>>                 goto out;
>>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>>
>>>>         cic = cfq_alloc_io_context(cfqd, gfp_mask);
>>>>         if (cic == NULL)
>>>>                 goto err;
>>>>
>>>> -       if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
>>>> +       spin_lock_irqsave(cfqd->queue->queue_lock, flags);
>>>> +
>>>> +       /* check the consistency breakage during unlock period */
>>>> +       cfqg2 = cfq_get_cfqg(cfqd, bio, 0);
>>>> +       if (cfqg != cfqg2) {
>>>> +               cfq_cic_free(cic);
>>>> +               if (!cfqg2)
>>>> +                       goto retry_cfqg;
>>>> +               else {
>>>> +                       cfqg = cfqg2;
>>>> +                       goto retry_cic;
>>>> +               }
>>>> +       }
>>>> +
>>>> +       if (cfq_cic_link(cfqd, cfqg, ioc, cic))
>>>>                 goto err_free;
>>>>
>>>>   out:
>>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>>         smp_read_barrier_depends();
>>>>         if (unlikely(ioc->ioprio_changed))
>>>>                 cfq_ioc_set_ioprio(ioc);
>>>> @@ -3113,6 +3194,7 @@ out:
>>>>   #endif
>>>>         return cic;
>>>>   err_free:
>>>> +       spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
>>>>         cfq_cic_free(cic);
>>>>   err:
>>>>         put_io_context(ioc);
>>>> @@ -3537,6 +3619,7 @@ static inline int __cfq_may_queue(struct cfq_queue
>>>> *cfqq)
>>>>   static int cfq_may_queue(struct request_queue *q, struct bio *bio, int
>>>> rw)
>>>>   {
>>>>         struct cfq_data *cfqd = q->elevator->elevator_data;
>>>> +       struct cfq_group *cfqg;
>>>>         struct task_struct *tsk = current;
>>>>         struct cfq_io_context *cic;
>>>>         struct cfq_queue *cfqq;
>>>> @@ -3547,7 +3630,8 @@ static int cfq_may_queue(struct request_queue *q,
>>>> struct bio *bio, int rw)
>>>>          * so just lookup a possibly existing queue, or return 'may queue'
>>>>          * if that fails
>>>>          */
>>>> -       cic = cfq_cic_lookup(cfqd, tsk->io_context);
>>>> +       cfqg = cfq_get_cfqg(cfqd, bio, 0);
>>>> +       cic = cfq_cic_lookup(cfqd, cfqg, tsk->io_context);
>>>>         if (!cic)
>>>>                 return ELV_MQUEUE_MAY;
>>>>
>>>> @@ -3636,7 +3720,7 @@ cfq_set_request(struct request_queue *q, struct
>>>> request *rq, struct bio *bio,
>>>>
>>>>         might_sleep_if(gfp_mask&    __GFP_WAIT);
>>>>
>>>> -       cic = cfq_get_io_context(cfqd, gfp_mask);
>>>> +       cic = cfq_get_io_context(cfqd, bio, gfp_mask);
>>>>
>>>>         spin_lock_irqsave(q->queue_lock, flags);
>>>>
>>>> @@ -3646,7 +3730,7 @@ cfq_set_request(struct request_queue *q, struct
>>>> request *rq, struct bio *bio,
>>>>   new_queue:
>>>>         cfqq = cic_to_cfqq(cic, is_sync);
>>>>         if (!cfqq || cfqq ==&cfqd->oom_cfqq) {
>>>> -               cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
>>>> +               cfqq = cfq_get_queue(cfqd, is_sync, cic, gfp_mask);
>>>>                 cic_set_cfqq(cic, cfqq, is_sync);
>>>>         } else {
>>>>                 /*
>>>> @@ -3762,19 +3846,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data
>>>> *cfqd)
>>>>         cancel_work_sync(&cfqd->unplug_work);
>>>>   }
>>>>
>>>> -static void cfq_put_async_queues(struct cfq_data *cfqd)
>>>> +static void cfq_put_async_queues(struct cfq_group *cfqg)
>>>>   {
>>>>         int i;
>>>>
>>>>         for (i = 0; i<    IOPRIO_BE_NR; i++) {
>>>> -               if (cfqd->async_cfqq[0][i])
>>>> -                       cfq_put_queue(cfqd->async_cfqq[0][i]);
>>>> -               if (cfqd->async_cfqq[1][i])
>>>> -                       cfq_put_queue(cfqd->async_cfqq[1][i]);
>>>> +               if (cfqg->async_cfqq[0][i])
>>>> +                       cfq_put_queue(cfqg->async_cfqq[0][i]);
>>>> +               if (cfqg->async_cfqq[1][i])
>>>> +                       cfq_put_queue(cfqg->async_cfqq[1][i]);
>>>>         }
>>>>
>>>> -       if (cfqd->async_idle_cfqq)
>>>> -               cfq_put_queue(cfqd->async_idle_cfqq);
>>>> +       if (cfqg->async_idle_cfqq)
>>>> +               cfq_put_queue(cfqg->async_idle_cfqq);
>>>>   }
>>>>
>>>>   static void cfq_cfqd_free(struct rcu_head *head)
>>>> @@ -3794,15 +3878,6 @@ static void cfq_exit_queue(struct elevator_queue
>>>> *e)
>>>>         if (cfqd->active_queue)
>>>>                 __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
>>>>
>>>> -       while (!list_empty(&cfqd->cic_list)) {
>>>> -               struct cfq_io_context *cic =
>>>> list_entry(cfqd->cic_list.next,
>>>> -                                                       struct
>>>> cfq_io_context,
>>>> -                                                       queue_list);
>>>> -
>>>> -               __cfq_exit_single_io_context(cfqd, cic);
>>>> -       }
>>>> -
>>>> -       cfq_put_async_queues(cfqd);
>>>>         cfq_release_cfq_groups(cfqd);
>>>>         cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
>>>>
>>>> @@ -3810,10 +3885,6 @@ static void cfq_exit_queue(struct elevator_queue
>>>> *e)
>>>>
>>>>         cfq_shutdown_timer_wq(cfqd);
>>>>
>>>> -       spin_lock(&cic_index_lock);
>>>> -       ida_remove(&cic_index_ida, cfqd->cic_index);
>>>> -       spin_unlock(&cic_index_lock);
>>>> -
>>>>         /* Wait for cfqg->blkg->key accessors to exit their grace periods.
>>>> */
>>>>         call_rcu(&cfqd->rcu, cfq_cfqd_free);
>>>>   }
>>>> @@ -3823,7 +3894,7 @@ static int cfq_alloc_cic_index(void)
>>>>         int index, error;
>>>>
>>>>         do {
>>>> -               if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
>>>> +               if (!ida_pre_get(&cic_index_ida, GFP_ATOMIC))
>>>>                         return -ENOMEM;
>>>>
>>>>                 spin_lock(&cic_index_lock);
>>>> @@ -3839,20 +3910,18 @@ static int cfq_alloc_cic_index(void)
>>>>   static void *cfq_init_queue(struct request_queue *q)
>>>>   {
>>>>         struct cfq_data *cfqd;
>>>> -       int i, j;
>>>> +       int i, j, idx;
>>>>         struct cfq_group *cfqg;
>>>>         struct cfq_rb_root *st;
>>>>
>>>> -       i = cfq_alloc_cic_index();
>>>> -       if (i<    0)
>>>> +       idx = cfq_alloc_cic_index();
>>>> +       if (idx<    0)
>>>>                 return NULL;
>>>>
>>>>         cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO,
>>>> q->node);
>>>>         if (!cfqd)
>>>>                 return NULL;
>>>>
>>>> -       cfqd->cic_index = i;
>>>> -
>>>>         /* Init root service tree */
>>>>         cfqd->grp_service_tree = CFQ_RB_ROOT;
>>>>
>>>> @@ -3861,6 +3930,9 @@ static void *cfq_init_queue(struct request_queue
>>>> *q)
>>>>         for_each_cfqg_st(cfqg, i, j, st)
>>>>                 *st = CFQ_RB_ROOT;
>>>>         RB_CLEAR_NODE(&cfqg->rb_node);
>>>> +       cfqg->cfqd = cfqd;
>>>> +       cfqg->cic_index = idx;
>>>> +       INIT_LIST_HEAD(&cfqg->cic_list);
>>>>
>>>>         /* Give preference to root group over other groups */
>>>>         cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
>>>> @@ -3874,6 +3946,7 @@ static void *cfq_init_queue(struct request_queue
>>>> *q)
>>>>         rcu_read_lock();
>>>>         cfq_blkiocg_add_blkio_group(&blkio_root_cgroup,&cfqg->blkg,
>>>>                                         (void *)cfqd, 0);
>>>> +       hlist_add_head(&cfqg->cfqd_node,&cfqd->cfqg_list);
>>>>         rcu_read_unlock();
>>>>   #endif
>>>>         /*
>>>> @@ -3893,8 +3966,6 @@ static void *cfq_init_queue(struct request_queue
>>>> *q)
>>>>         atomic_inc(&cfqd->oom_cfqq.ref);
>>>>         cfq_link_cfqq_cfqg(&cfqd->oom_cfqq,&cfqd->root_group);
>>>>
>>>> -       INIT_LIST_HEAD(&cfqd->cic_list);
>>>> -
>>>>         cfqd->queue = q;
>>>>
>>>>         init_timer(&cfqd->idle_slice_timer);
>>>> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
>>>> index 64d5291..6c05b54 100644
>>>> --- a/include/linux/iocontext.h
>>>> +++ b/include/linux/iocontext.h
>>>> @@ -18,7 +18,7 @@ struct cfq_io_context {
>>>>         unsigned long ttime_samples;
>>>>         unsigned long ttime_mean;
>>>>
>>>> -       struct list_head queue_list;
>>>> +       struct list_head group_list;
>>>>         struct hlist_node cic_list;
>>>>
>>>>         void (*dtor)(struct io_context *); /* destructor */
>>>> --
>>>> 1.6.2.5
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>>
>>>
>>
>> --
>> IKEDA, Munehiro
>>   NEC Corporation of America
>>     m-ikeda@ds.jp.nec.com
>>
>>
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2010-08-14  0:53 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-09  2:57 [RFC][PATCH 00/11] blkiocg async support Munehiro Ikeda
2010-07-09  3:14 ` [RFC][PATCH 01/11] blkiocg async: Make page_cgroup independent from memory controller Munehiro Ikeda
2010-07-26  6:49   ` Balbir Singh
2010-07-09  3:15 ` [RFC][PATCH 02/11] blkiocg async: The main part of iotrack Munehiro Ikeda
2010-07-09  7:35   ` KAMEZAWA Hiroyuki
2010-07-09 23:06     ` Munehiro Ikeda
2010-07-12  0:11       ` KAMEZAWA Hiroyuki
2010-07-14 14:46         ` Munehiro IKEDA
2010-07-09  7:38   ` KAMEZAWA Hiroyuki
2010-07-09 23:09     ` Munehiro Ikeda
2010-07-10 10:06       ` Andrea Righi
2010-07-09  3:16 ` [RFC][PATCH 03/11] blkiocg async: Hooks for iotrack Munehiro Ikeda
2010-07-09  9:24   ` Andrea Righi
2010-07-09 23:43     ` Munehiro Ikeda
2010-07-09  3:16 ` [RFC][PATCH 04/11] blkiocg async: block_commit_write not to record process info Munehiro Ikeda
2010-07-09  3:17 ` [RFC][PATCH 05/11] blkiocg async: __set_page_dirty_nobuffer " Munehiro Ikeda
2010-07-09  3:17 ` [RFC][PATCH 06/11] blkiocg async: ext4_writepage not to overwrite iotrack info Munehiro Ikeda
2010-07-09  3:18 ` [RFC][PATCH 07/11] blkiocg async: Pass bio to elevator_ops functions Munehiro Ikeda
2010-07-09  3:19 ` [RFC][PATCH 08/11] blkiocg async: Function to search blkcg from css ID Munehiro Ikeda
2010-07-09  3:20 ` [RFC][PATCH 09/11] blkiocg async: Functions to get cfqg from bio Munehiro Ikeda
2010-07-09  3:22 ` [RFC][PATCH 10/11] blkiocg async: Async queue per cfq_group Munehiro Ikeda
2010-08-13  1:24   ` Nauman Rafique
2010-08-13 21:00     ` Munehiro Ikeda
2010-08-13 23:01       ` Nauman Rafique
2010-08-14  0:49         ` Munehiro Ikeda
2010-07-09  3:23 ` [RFC][PATCH 11/11] blkiocg async: Workload timeslice adjustment for async queues Munehiro Ikeda
2010-07-09 10:04 ` [RFC][PATCH 00/11] blkiocg async support Andrea Righi
2010-07-09 13:45 ` Vivek Goyal
2010-07-10  0:17   ` Munehiro Ikeda
2010-07-10  0:55     ` Nauman Rafique
2010-07-10 13:24       ` Vivek Goyal
2010-07-12  0:20         ` KAMEZAWA Hiroyuki
2010-07-12 13:18           ` Vivek Goyal
2010-07-13  4:36             ` KAMEZAWA Hiroyuki
2010-07-14 14:29               ` Vivek Goyal
2010-07-15  0:00                 ` KAMEZAWA Hiroyuki
2010-07-16 13:43                   ` Vivek Goyal
2010-07-16 14:15                     ` Daniel P. Berrange
2010-07-16 14:35                       ` Vivek Goyal
2010-07-16 14:53                         ` Daniel P. Berrange
2010-07-16 15:12                           ` Vivek Goyal
2010-07-27 10:40                             ` Daniel P. Berrange
2010-07-27 14:03                               ` Vivek Goyal
2010-07-22 19:28           ` Greg Thelen
2010-07-22 23:59             ` KAMEZAWA Hiroyuki
2010-07-26  6:41 ` Balbir Singh
2010-07-27  6:40   ` Greg Thelen
2010-07-27  6:39     ` KAMEZAWA Hiroyuki
2010-08-02 20:58 ` Vivek Goyal
2010-08-03 14:31   ` Munehiro Ikeda
2010-08-03 19:24     ` Nauman Rafique
2010-08-04 14:32       ` Munehiro Ikeda
2010-08-03 20:15     ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.