[RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
@ 2011-03-08 21:20 Justin TerAvest
  2011-03-08 21:20 ` [PATCH 1/6] Add IO cgroup tracking " Justin TerAvest
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin

This patchset adds tracking to the page_cgroup structure for which cgroup has
dirtied a page, and uses that information to provide isolation between
cgroups performing writeback.

I know that there is some discussion to remove request descriptor limits
entirely, but I included a patch to introduce per-cgroup limits to enable
this functionality. Without it, we didn't see much isolation improvement.

I think most of this material has been discussed on lkml previously, this is
just another attempt to make a patchset that handles buffered writes for CFQ.

There was a lot of previous discussion at:
  http://thread.gmane.org/gmane.linux.kernel/1007922

Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
and Vivek Goyal for work on previous versions of these patches.

 Documentation/block/biodoc.txt |   10 +
 block/blk-cgroup.c             |  204 +++++++++++++++++++++-
 block/blk-cgroup.h             |    9 +-
 block/blk-core.c               |  216 +++++++++++++++--------
 block/blk-settings.c           |    2 +-
 block/blk-sysfs.c              |   60 ++++---
 block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
 block/cfq.h                    |    6 +-
 block/elevator.c               |   11 +-
 fs/buffer.c                    |    2 +
 fs/direct-io.c                 |    2 +
 include/linux/blk_types.h      |    2 +
 include/linux/blkdev.h         |   81 ++++++++-
 include/linux/blkio-track.h    |   89 +++++++++
 include/linux/elevator.h       |   14 ++-
 include/linux/iocontext.h      |    1 +
 include/linux/memcontrol.h     |    6 +
 include/linux/mmzone.h         |    4 +-
 include/linux/page_cgroup.h    |   12 +-
 init/Kconfig                   |   16 ++
 mm/Makefile                    |    3 +-
 mm/bounce.c                    |    2 +
 mm/filemap.c                   |    2 +
 mm/memcontrol.c                |    6 +
 mm/memory.c                    |    6 +
 mm/page-writeback.c            |   14 ++-
 mm/page_cgroup.c               |   29 ++-
 mm/swap_state.c                |    2 +
 28 files changed, 985 insertions(+), 216 deletions(-)

[PATCH 1/6] Add IO cgroup tracking for buffered writes.
[PATCH 2/6] Make async queues per cgroup.
[PATCH 3/6] Modify CFQ to use IO tracking information.
[PATCH 4/6] With per-cgroup async, don't special case queues.
[PATCH 5/6] Add stat for per cgroup writeout done by flusher.
[PATCH 6/6] Per cgroup request descriptor counts

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6] Add IO cgroup tracking for buffered writes.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 21:20 ` [PATCH 2/6] Make async queues per cgroup Justin TerAvest
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

This patch adds IO tracking code in the mm/ tree so that the block layer
can provide isolation for buffered writes. I've left modifying the
page_cgroup structure as simple as possible; I'm happy to change this to
using bits as part of the "flags" structure to reduce overhead.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 block/blk-cgroup.c          |  184 +++++++++++++++++++++++++++++++++++++++++++
 fs/buffer.c                 |    2 +
 fs/direct-io.c              |    2 +
 include/linux/blkio-track.h |   89 +++++++++++++++++++++
 include/linux/iocontext.h   |    1 +
 include/linux/memcontrol.h  |    6 ++
 include/linux/mmzone.h      |    4 +-
 include/linux/page_cgroup.h |   12 +++-
 init/Kconfig                |   16 ++++
 mm/Makefile                 |    3 +-
 mm/bounce.c                 |    2 +
 mm/filemap.c                |    2 +
 mm/memcontrol.c             |    6 ++
 mm/memory.c                 |    6 ++
 mm/page-writeback.c         |   14 +++-
 mm/page_cgroup.c            |   29 +++++---
 mm/swap_state.c             |    2 +
 17 files changed, 363 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/blkio-track.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..80d88ec 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -19,6 +19,8 @@
 #include <linux/slab.h>
 #include "blk-cgroup.h"
 #include <linux/genhd.h>
+#include <linux/blkio-track.h>
+#include <linux/mm_inline.h>
 
 #define MAX_KEY_LEN 100
 
@@ -175,6 +177,12 @@ static inline void blkio_update_group_iops(struct blkio_group *blkg,
 	}
 }
 
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_subsys_id),
+					struct blkio_cgroup, css);
+}
+
 /*
  * Add to the appropriate stat variable depending on the request type.
  * This should be called with the blkg->stats_lock held.
@@ -1233,8 +1241,20 @@ blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	return 0;
 }
 
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
+
+	return (u64)css_id(&blkcg->css);
+}
+
 struct cftype blkio_files[] = {
 	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+	{
 		.name = "weight_device",
 		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
 				BLKIO_PROP_weight_device),
@@ -1385,6 +1405,170 @@ struct cftype blkio_files[] = {
 #endif
 };
 
+/* Block IO tracking related functions */
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *blkcg;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	blkcg = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!blkcg))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "blkcg" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&blkcg->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return NULL;
+
+	css = css_lookup(&blkio_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
 static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
 	return cgroup_add_files(cgroup, subsys, blkio_files,
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..1e911dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/blkio-track.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -667,6 +668,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b044705..2e8d5aa 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/blkio-track.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -852,6 +853,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/blkio-track.h b/include/linux/blkio-track.h
new file mode 100644
index 0000000..aedf780
--- /dev/null
+++ b/include/linux/blkio-track.h
@@ -0,0 +1,89 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+struct block_device;
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIOTRACK */
+
+struct blkiotrack_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page,
+						struct page *opage)
+{
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b2eee89..3e70b21 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -76,6 +76,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(struct task_struct *task)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f512e18..a8a7cf0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -49,6 +49,8 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -153,6 +155,10 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..a04c37a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -615,7 +615,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -975,7 +975,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 6d6cb7a..c3e66fd 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -11,10 +11,15 @@
  * then the page cgroup for pfn always exists.
  */
 struct page_cgroup {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	unsigned long blkio_cgroup_id;
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -33,6 +38,8 @@ static inline void __init page_cgroup_init(void)
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+
 enum {
 	/* flags for mem_cgroup */
 	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
@@ -131,8 +138,9 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
 	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
 	local_irq_restore(*flags);
 }
+#endif
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index be788c0..256041f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -742,6 +742,22 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_BLKIOTRACK
+	bool
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	default n
+	---help---
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIOTRACK
+
 endif # CGROUPS
 
 menuconfig NAMESPACES
diff --git a/mm/Makefile b/mm/Makefile
index 2b1b575..7da3bc8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,7 +38,8 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..64980fb 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/blkio-track.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -211,6 +212,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..ab9b53a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -407,6 +408,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da53a25..e11c2cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,6 +359,12 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 31250fa..4735c3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -52,6 +52,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2403,6 +2404,7 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2828,6 +2830,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	do_page_add_anon_rmap(page, vma, address, exclusive);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2959,6 +2962,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	/* Not setting the owner for special pages */
+	blkio_cgroup_set_owner(page, mm);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -3114,6 +3119,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter_fast(mm, MM_ANONPAGES);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..b2a8f81 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blkio-track.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1153,7 +1154,8 @@ EXPORT_SYMBOL(account_page_writeback);
  * We take care to handle the case where the page was truncated from the
  * mapping by re-checking page_mapping() inside tree_lock.
  */
-int __set_page_dirty_nobuffers(struct page *page)
+int __set_page_dirty_nobuffers_track_owner(struct page *page,
+					   int update_owner)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
@@ -1168,6 +1170,9 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			if (update_owner)
+				blkio_cgroup_reset_owner_pagedirty(page,
+								current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
@@ -1180,6 +1185,11 @@ int __set_page_dirty_nobuffers(struct page *page)
 	}
 	return 0;
 }
+
+int __set_page_dirty_nobuffers(struct page *page)
+{
+	return __set_page_dirty_nobuffers_track_owner(page, 1);
+}
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
@@ -1190,7 +1200,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
-	return __set_page_dirty_nobuffers(page);
+	return __set_page_dirty_nobuffers_track_owner(page, 0);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
 
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..78f5425 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -10,14 +10,17 @@
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
+#include <linux/blkio-track.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+#endif
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -75,7 +78,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -84,12 +87,13 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -134,6 +138,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 		 */
 		kmemleak_not_leak(base);
 	} else {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
 		/*
  		 * We don't have to allocate page_cgroup again, but
 		 * address of memmap may be changed. So, we have to initialize
@@ -144,6 +149,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 		/* check address of memmap is changed or not. */
 		if (base->page == pfn_to_page(pfn))
 			return 0;
+#else
+		return 0;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 	}
 
 	if (!base) {
@@ -258,7 +266,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -267,14 +275,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5c8cfab..bd4c4e7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/blkio-track.h>
 
 #include <asm/pgtable.h>
 
@@ -330,6 +331,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/6] Make async queues per cgroup.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
  2011-03-08 21:20 ` [PATCH 1/6] Add IO cgroup tracking " Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 21:20 ` [PATCH 3/6] Modify CFQ to use IO tracking information Justin TerAvest
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

There is currently only one set of async queues. This patch moves them
to a per cgroup data structure. Changes are done to make sure per
cgroup async queue references are dropped when the cgroup goes away.

TESTED:
Verified by creating multiple cgroups that async queues were getting
created properly. Also made sure that the references are getting
dropped and queues getting deallocated properly in the two situations:
- Cgroup goes away first, while IOs are still being done.
- IOs stop getting done and then cgroup goes away.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 block/cfq-iosched.c |   57 +++++++++++++++++++++++++--------------------------
 1 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b68cd2d..1ca9fac 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -202,6 +202,12 @@ struct cfq_group {
 	struct cfq_rb_root service_trees[2][3];
 	struct cfq_rb_root service_tree_idle;
 
+	/*
+	 * async queue for each priority case
+	 */
+	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
+	struct cfq_queue *async_idle_cfqq;
+
 	unsigned long saved_workload_slice;
 	enum wl_type_t saved_workload;
 	enum wl_prio_t saved_serving_prio;
@@ -266,12 +272,6 @@ struct cfq_data {
 	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -449,6 +449,7 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
 				       struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
+static void cfq_put_async_queues(struct cfq_group *cfqg);
 
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 					    bool is_sync)
@@ -1100,10 +1101,6 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 
 static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
 {
-	/* Currently, all async queues are mapped to root group */
-	if (!cfq_cfqq_sync(cfqq))
-		cfqg = &cfqq->cfqd->root_group;
-
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
 	cfqq->cfqg->ref++;
@@ -1129,6 +1126,7 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
 
 	hlist_del_init(&cfqg->cfqd_node);
+	cfq_put_async_queues(cfqg);
 
 	/*
 	 * Put the reference taken at the time of creation so that when all
@@ -2897,15 +2895,13 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct cfq_group *cfqg,
+		     bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-	struct cfq_group *cfqg;
 
 retry:
-	cfqg = cfq_get_cfqg(cfqd, 1);
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -2949,15 +2945,15 @@ retry:
 }
 
 static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
+		return &cfqg->async_cfqq[0][ioprio];
 	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
+		return &cfqg->async_cfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
+		return &cfqg->async_idle_cfqq;
 	default:
 		BUG();
 	}
@@ -2971,17 +2967,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue **async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
+		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
+						  ioprio);
 		cfqq = *async_cfqq;
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, cfqg, is_sync, ioc, gfp_mask);
 
 	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
+	 * pin the queue now that it's allocated, cgroup deletion will prune it
 	 */
 	if (!is_sync && !(*async_cfqq)) {
 		cfqq->ref++;
@@ -3791,19 +3789,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
 	cancel_work_sync(&cfqd->unplug_work);
 }
 
-static void cfq_put_async_queues(struct cfq_data *cfqd)
+static void cfq_put_async_queues(struct cfq_group *cfqg)
 {
 	int i;
 
 	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
+		if (cfqg->async_cfqq[0][i])
+			cfq_put_queue(cfqg->async_cfqq[0][i]);
+		if (cfqg->async_cfqq[1][i])
+			cfq_put_queue(cfqg->async_cfqq[1][i]);
 	}
 
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
+	if (cfqg->async_idle_cfqq)
+		cfq_put_queue(cfqg->async_idle_cfqq);
 }
 
 static void cfq_cfqd_free(struct rcu_head *head)
@@ -3831,8 +3829,9 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
 	cfq_release_cfq_groups(cfqd);
+	/* Release the queues of root group. */
+	cfq_put_async_queues(&cfqd->root_group);
 	cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
 
 	spin_unlock_irq(q->queue_lock);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/6] Modify CFQ to use IO tracking information.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
  2011-03-08 21:20 ` [PATCH 1/6] Add IO cgroup tracking " Justin TerAvest
  2011-03-08 21:20 ` [PATCH 2/6] Make async queues per cgroup Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 21:20 ` [PATCH 4/6] With per-cgroup async, don't special case queues Justin TerAvest
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

IO tracking bits are used to send IOs to the right async
queue. Current task is still used to identify the cgroup of the
synchronous IO. Current task is also used if IO tracking is disabled.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 block/blk-cgroup.c       |    6 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  169 ++++++++++++++++++++++++++++++++++++++++-----
 block/elevator.c         |    5 +-
 include/linux/elevator.h |    6 +-
 5 files changed, 165 insertions(+), 28 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 80d88ec..0f147aa 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -111,6 +111,9 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev,
 
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &blkio_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
 			    struct blkio_cgroup, css);
 }
@@ -1537,6 +1540,7 @@ unsigned long get_blkio_cgroup_id(struct bio *bio)
 		id = pc->blkio_cgroup_id;
 	return id;
 }
+EXPORT_SYMBOL(get_blkio_cgroup_id);
 
 /**
  * get_cgroup_from_page() - determine the cgroup from a page.
@@ -1563,8 +1567,6 @@ struct cgroup *get_cgroup_from_page(struct page *page)
 
 	return css->cgroup;
 }
-
-EXPORT_SYMBOL(get_blkio_cgroup_id);
 EXPORT_SYMBOL(get_cgroup_from_page);
 
 #endif /* CONFIG_CGROUP_BLKIOTRACK */
diff --git a/block/blk-core.c b/block/blk-core.c
index 2f4002f..48c1f00 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -671,7 +671,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -683,7 +684,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -824,7 +825,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1ca9fac..75902b1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include <linux/blkio-track.h>
 #include "cfq.h"
 
 /*
@@ -303,6 +304,10 @@ struct cfq_data {
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+				   struct bio *bio, int create);
+static struct cfq_queue **
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);
 
 static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
 					    enum wl_prio_t prio,
@@ -445,7 +450,7 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
 				       struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
@@ -457,9 +462,55 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[is_sync];
 }
 
+/*
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
+ */
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
+{
+	struct cfq_queue *cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct cfq_group *cfqg;
+		struct cfq_queue **async_cfqq;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+		if (!cfqg) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+		return *async_cfqq;
+	}
+#endif
+	return cfqq;
+}
+
 static inline void cic_set_cfqq(struct cfq_io_context *cic,
 				struct cfq_queue *cfqq, bool is_sync)
 {
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
 	cic->cfqq[is_sync] = cfqq;
 }
 
@@ -1028,7 +1079,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 	unsigned int major, minor;
 
 	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
-	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+	if (!bdi || !bdi->dev || !dev_name(bdi->dev))
+		goto done;
+	if (cfqg && !cfqg->blkg.dev) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		cfqg->blkg.dev = MKDEV(major, minor);
 		goto done;
@@ -1079,16 +1132,28 @@ done:
  * Search for the cfq group current task belongs to. If create = 1, then also
  * create the cfq group if it does not exist. request_queue lock must be held.
  */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
+				      int create)
 {
 	struct cgroup *cgroup;
 	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, blkio_subsys_id);
+
+	if (!page)
+		cgroup = task_cgroup(current, blkio_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
+		cfqg = &cfqd->root_group;
+		goto out;
+	}
+
 	cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
 	if (!cfqg && create)
 		cfqg = &cfqd->root_group;
+out:
 	rcu_read_unlock();
 	return cfqg;
 }
@@ -1099,6 +1164,32 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 	return cfqg;
 }
 
+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+					struct bio *bio, int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	/* Map the sync bio to the right group using task context */
+	if (cfq_bio_sync(bio))
+		goto sync;
+
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+#endif
+
+sync:
+	return cfq_get_cfqg(cfqd, page, create);
+}
+
 static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
 {
 	cfqq->cfqg = cfqg;
@@ -1176,7 +1267,8 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 }
 
 #else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+					  struct bio *bio, int create)
 {
 	return &cfqd->root_group;
 }
@@ -1481,7 +1573,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -1605,7 +1697,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return false;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
 	return cfqq == RQ_CFQQ(rq);
 }
 
@@ -2816,14 +2908,10 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
-		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
-						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -2863,6 +2951,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 {
 	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	unsigned long flags;
 	struct request_queue *q;
@@ -2884,6 +2973,12 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 		cfq_put_queue(sync_cfqq);
 	}
 
+	if (async_cfqq != NULL) {
+		cfq_log_cfqq(cfqd, async_cfqq, "changed cgroup");
+		cic_set_cfqq(cic, NULL, 0);
+		cfq_put_queue(async_cfqq);
+	}
+
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
@@ -2906,6 +3001,24 @@ retry:
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct cfq_queue **async_cfqq;
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+		cfqq = *async_cfqq;
+	}
+#endif
+
 	/*
 	 * Always try a new alloc if we fell back to the OOM cfqq
 	 * originally, since it should just be a temporary situation.
@@ -2960,14 +3073,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
+	      struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue **async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);
+	struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -2986,7 +3099,24 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 		*async_cfqq = cfqq;
 	}
 
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 */
+	if (is_sync)
+		cfqq->ref++;
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	cfqq->ref++;
+#endif
 	return cfqq;
 }
 
@@ -3652,7 +3782,8 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -3673,7 +3804,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	} else {
 		/*
diff --git a/block/elevator.c b/block/elevator.c
index 2569512..9854cf6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -752,12 +752,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4d85797..496c182 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -25,7 +25,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -119,7 +120,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/6] With per-cgroup async, don't special case queues.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
                   ` (2 preceding siblings ...)
  2011-03-08 21:20 ` [PATCH 3/6] Modify CFQ to use IO tracking information Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 21:20 ` [PATCH 5/6] Add stat for per cgroup writeout done by flusher Justin TerAvest
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

This reverts effectively reverts commit f26bd1f0a3a31bc5e16d285f5e1b00a56abf6238
"blkio: Determine async workload length based on total number of queues"
in the case when async IO tracking is enabled.

That commit was used to work around the fact that async
queues were part of root cgroup. That is no longer the case when we have
async write tracking enabed.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 block/cfq-iosched.c |   90 +++++++++++++++++++++++++++++++-------------------
 1 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 75902b1..ef01dd8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -442,13 +442,6 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
 		+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
 }
 
-static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
-					struct cfq_group *cfqg)
-{
-	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
-}
-
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
 				       struct io_context *, gfp_t);
@@ -1011,27 +1004,50 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 	return slice_used;
 }
 
-static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
-				struct cfq_queue *cfqq)
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+static inline int cfqg_busy_async_queues(struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-	unsigned int used_sl, charge;
-	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
+	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
+		+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+}
+#endif
+
+static void cfq_charge_slice(struct cfq_rb_root *st, struct cfq_group *cfqg,
+				struct cfq_queue *cfqq, unsigned int used_sl)
+{
+	struct cfq_data *cfqd = cfqq->cfqd;
+	unsigned int charge = used_sl;
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqg)
 			- cfqg->service_tree_idle.count;
 
 	BUG_ON(nr_sync < 0);
-	used_sl = charge = cfq_cfqq_slice_usage(cfqq);
 
 	if (iops_mode(cfqd))
 		charge = cfqq->slice_dispatch;
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
+#endif
 
 	/* Can't update vdisktime while group is on service tree */
 	cfq_group_service_tree_del(st, cfqg);
 	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
 	/* If a new weight was requested, update now, off tree */
+	cfq_update_group_weight(cfqg);
 	cfq_group_service_tree_add(st, cfqg);
+	cfq_log_cfqq(cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+			" sect=%u", used_sl, cfqq->slice_dispatch, charge,
+			iops_mode(cfqd), cfqq->nr_sectors);
+}
+
+static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
+				struct cfq_queue *cfqq)
+{
+	struct cfq_rb_root *st = &cfqd->grp_service_tree;
+	unsigned int used_sl;
+
+	used_sl = cfq_cfqq_slice_usage(cfqq);
+	cfq_charge_slice(st, cfqg, cfqq, used_sl);
 
 	/* This group is being expired. Save the context */
 	if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1044,9 +1060,6 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
-			" sect=%u", used_sl, cfqq->slice_dispatch, charge,
-			iops_mode(cfqd), cfqq->nr_sectors);
 	cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
 	cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
 }
@@ -2198,6 +2211,30 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 	return cur_best;
 }
 
+static unsigned cfq_async_slice(struct cfq_data *cfqd, struct cfq_group *cfqg,
+				unsigned slice)
+{
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+	unsigned tmp;
+	/*
+	 * Async queues are currently system wide. Just taking
+	 * proportion of queues with-in same group will lead to higher
+	 * async ratio system wide as generally root group is going
+	 * to have higher weight. A more accurate thing would be to
+	 * calculate system wide asnc/sync ratio.
+	 */
+	tmp = cfq_target_latency * cfqg_busy_async_queues(cfqg);
+	tmp = tmp/cfqd->busy_queues;
+	slice = min_t(unsigned, slice, tmp);
+#endif
+	/*
+	 * async workload slice is scaled down according to
+	 * the sync/async slice ratio.
+	 */
+	slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
+	return slice;
+}
+
 static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
 	unsigned slice;
@@ -2252,24 +2289,9 @@ new_workload:
 		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
 		      cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
 
-	if (cfqd->serving_type == ASYNC_WORKLOAD) {
-		unsigned int tmp;
-
-		/*
-		 * Async queues are currently system wide. Just taking
-		 * proportion of queues with-in same group will lead to higher
-		 * async ratio system wide as generally root group is going
-		 * to have higher weight. A more accurate thing would be to
-		 * calculate system wide asnc/sync ratio.
-		 */
-		tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
-		tmp = tmp/cfqd->busy_queues;
-		slice = min_t(unsigned, slice, tmp);
-
-		/* async workload slice is scaled down according to
-		 * the sync/async slice ratio. */
-		slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
-	} else
+	if (cfqd->serving_type == ASYNC_WORKLOAD)
+		slice = cfq_async_slice(cfqd, cfqg, slice);
+	else
 		/* sync workload slice is at least 2 * cfq_slice_idle */
 		slice = max(slice, 2 * cfqd->cfq_slice_idle);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
                   ` (3 preceding siblings ...)
  2011-03-08 21:20 ` [PATCH 4/6] With per-cgroup async, don't special case queues Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 21:20 ` [PATCH 6/6] Per cgroup request descriptor counts Justin TerAvest
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

Tracking for buffered writes can detect when traffic comes from a
flusher thread, as opposed to directly from an application. This adds a
statistic to track I/O traffic from flusher threads.

This helps determine whether a flusher thread is being unfair to a
particular cgroup, and if cgroup-based isolation of writeback behavior
is useful.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 block/blk-cgroup.c        |   18 ++++++++++++++++-
 block/blk-cgroup.h        |    9 ++++++-
 block/cfq-iosched.c       |   47 ++++++++++++++++++++++++++------------------
 block/cfq.h               |    6 +++-
 include/linux/blk_types.h |    2 +
 5 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0f147aa..93d2a08 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -410,7 +410,8 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);
 
 void blkiocg_update_completion_stats(struct blkio_group *blkg,
-	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+	bool out_of_ctx)
 {
 	struct blkio_group_stats *stats;
 	unsigned long flags;
@@ -424,6 +425,8 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
 	if (time_after64(io_start_time, start_time))
 		blkio_add_stat(stats->stat_arr[BLKIO_STAT_WAIT_TIME],
 				io_start_time - start_time, direction, sync);
+	if (out_of_ctx)
+		blkg->stats.oo_ctx_io_count++;
 	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
@@ -615,6 +618,9 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
 					blkg->stats.sectors, cb, dev);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	if (type == BLKIO_STAT_OO_CTX_IO_COUNT)
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+					blkg->stats.oo_ctx_io_count, cb, dev);
 	if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
 		uint64_t sum = blkg->stats.avg_queue_size_sum;
 		uint64_t samples = blkg->stats.avg_queue_size_samples;
@@ -1151,6 +1157,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 		case BLKIO_PROP_empty_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
 						BLKIO_STAT_EMPTY_TIME, 0);
+		case BLKIO_PROP_oo_ctx_io_count:
+			return blkio_read_blkg_stats(blkcg, cft, cb,
+						BLKIO_STAT_OO_CTX_IO_COUNT, 0);
+
 #endif
 		default:
 			BUG();
@@ -1405,6 +1415,12 @@ struct cftype blkio_files[] = {
 				BLKIO_PROP_dequeue),
 		.read_map = blkiocg_file_read_map,
 	},
+	{
+		.name = "oo_ctx_io_count",
+		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+				BLKIO_PROP_oo_ctx_io_count),
+		.read_map = blkiocg_file_read_map,
+	},
 #endif
 };
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ea4861b..bdb1b73 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -50,6 +50,7 @@ enum stat_type {
 	BLKIO_STAT_TIME,
 	BLKIO_STAT_SECTORS,
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	BLKIO_STAT_OO_CTX_IO_COUNT,
 	BLKIO_STAT_AVG_QUEUE_SIZE,
 	BLKIO_STAT_IDLE_TIME,
 	BLKIO_STAT_EMPTY_TIME,
@@ -90,6 +91,7 @@ enum blkcg_file_name_prop {
 	BLKIO_PROP_idle_time,
 	BLKIO_PROP_empty_time,
 	BLKIO_PROP_dequeue,
+	BLKIO_PROP_oo_ctx_io_count,
 };
 
 /* cgroup files owned by throttle policy */
@@ -114,6 +116,8 @@ struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
 	uint64_t sectors;
+	/* Number of IOs sumbitted out of process context */
+	uint64_t oo_ctx_io_count;
 	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* Sum of number of IOs queued across all samples */
@@ -297,7 +301,8 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
 						bool direction, bool sync);
 void blkiocg_update_completion_stats(struct blkio_group *blkg,
-	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
+	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+	bool out_of_ctx);
 void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
 					bool sync);
 void blkiocg_update_io_add_stats(struct blkio_group *blkg,
@@ -324,7 +329,7 @@ static inline void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 				uint64_t bytes, bool direction, bool sync) {}
 static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
 		uint64_t start_time, uint64_t io_start_time, bool direction,
-		bool sync) {}
+						bool sync, bool out_of_ctx) {}
 static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
 						bool direction, bool sync) {}
 static inline void blkiocg_update_io_add_stats(struct blkio_group *blkg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ef01dd8..fa5a34d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -305,7 +305,7 @@ struct cfq_data {
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
-				   struct bio *bio, int create);
+			struct bio *bio, int *is_oo_ctx, int create);
 static struct cfq_queue **
 cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);
 
@@ -443,8 +443,8 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*,
+		int *is_oo_ctx, bool, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 static void cfq_put_async_queues(struct cfq_group *cfqg);
@@ -478,7 +478,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
 		 * async bio tracking is enabled and we are not caching
 		 * async queue pointer in cic.
 		 */
-		cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+		cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 0);
 		if (!cfqg) {
 			/*
 			 * May be this is first rq/bio and io group has not
@@ -1146,17 +1146,21 @@ done:
  * create the cfq group if it does not exist. request_queue lock must be held.
  */
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
-				      int create)
+				      int *is_oo_ctx, int create)
 {
-	struct cgroup *cgroup;
+	struct cgroup *cgroup, *tracked_cgroup;
 	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
 
-	if (!page)
-		cgroup = task_cgroup(current, blkio_subsys_id);
-	else
-		cgroup = get_cgroup_from_page(page);
+	cgroup = task_cgroup(current, blkio_subsys_id);
+	if (page) {
+		tracked_cgroup = get_cgroup_from_page(page);
+		if (is_oo_ctx)
+			*is_oo_ctx = cgroup && tracked_cgroup &&
+				tracked_cgroup != cgroup;
+		cgroup = tracked_cgroup;
+	}
 
 	if (!cgroup) {
 		cfqg = &cfqd->root_group;
@@ -1177,8 +1181,8 @@ static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 	return cfqg;
 }
 
-struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
-					struct bio *bio, int create)
+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd, struct bio *bio,
+					int *is_oo_ctx, int create)
 {
 	struct page *page = NULL;
 
@@ -1200,7 +1204,7 @@ struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
 #endif
 
 sync:
-	return cfq_get_cfqg(cfqd, page, create);
+	return cfq_get_cfqg(cfqd, page, is_oo_ctx, create);
 }
 
 static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
@@ -1281,7 +1285,7 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 
 #else /* GROUP_IOSCHED */
 static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
-					  struct bio *bio, int create)
+			struct bio *bio, int *is_oo_ctx, int create)
 {
 	return &cfqd->root_group;
 }
@@ -3095,14 +3099,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
-	      struct io_context *ioc, gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
+	      bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue **async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);
+	struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);
 
 	if (!is_sync) {
 		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3625,7 +3629,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	(RQ_CFQG(rq))->dispatched--;
 	cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
 			rq_start_time_ns(rq), rq_io_start_time_ns(rq),
-			rq_data_dir(rq), rq_is_sync(rq));
+			rq_data_dir(rq), rq_is_sync(rq),
+			rq->cmd_flags & REQ_OUT_OF_CTX);
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
 
@@ -3813,6 +3818,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	int is_oo_ctx = 0;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -3826,8 +3832,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, &is_oo_ctx, is_sync, cic->ioc,
+					gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
+		if (is_oo_ctx)
+			rq->cmd_flags |= REQ_OUT_OF_CTX;
 	} else {
 		/*
 		 * If the queue was seeky for too long, break it apart.
diff --git a/block/cfq.h b/block/cfq.h
index 54a6d90..23410eb 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -61,10 +61,12 @@ static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 	blkiocg_update_dispatch_stats(blkg, bytes, direction, sync);
 }
 
-static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
+				uint64_t start_time, uint64_t io_start_time,
+				bool direction, bool sync, bool out_of_ctx)
 {
 	blkiocg_update_completion_stats(blkg, start_time, io_start_time,
-				direction, sync);
+				direction, sync, out_of_ctx);
 }
 
 static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..eb25b06 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -151,6 +151,7 @@ enum rq_flag_bits {
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
+	__REQ_OUT_OF_CTX,	/* request submitted out of process context */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -191,5 +192,6 @@ enum rq_flag_bits {
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
+#define REQ_OUT_OF_CTX		(1 << __REQ_OUT_OF_CTX)
 
 #endif /* __LINUX_BLK_TYPES_H */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 6/6] Per cgroup request descriptor counts
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
                   ` (4 preceding siblings ...)
  2011-03-08 21:20 ` [PATCH 5/6] Add stat for per cgroup writeout done by flusher Justin TerAvest
@ 2011-03-08 21:20 ` Justin TerAvest
  2011-03-08 22:43 ` [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Vivek Goyal
  2011-03-09  5:22 ` KAMEZAWA Hiroyuki
  7 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-08 21:20 UTC (permalink / raw)
  To: m-ikeda, jaxboe, vgoyal
  Cc: linux-kernel, ryov, taka, kamezawa.hiroyu, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, Justin TerAvest

If a cgroup gets starved, it keeps allocating request
descriptors until it hits the limit. Other cgroups could be
starved of request descriptors, which is a problem.
This patch implements per-cgroup request descriptor limits.

There has been discussion to remove limits entirely, but they are
introduced here to enable tracking for buffered writes without causing
problems for the per request-queue limit.

There are some interesting corner cases that are also covered.
During elevator switch, we start counting the request descriptors
towards the request list of the root group. We drain all the requests
that were started before the switch. If new requests arrive after
the switch is progressing, they are counted towards the common, shared
request list of the root group. As these requests complete, the counter
in the common request list is decremented properly.

Signed-off-by: Justin TerAvest <teravest@google.com>
---
 Documentation/block/biodoc.txt |   10 ++
 block/blk-core.c               |  209 +++++++++++++++++++++++++++-------------
 block/blk-settings.c           |    2 +-
 block/blk-sysfs.c              |   60 +++++++-----
 block/cfq-iosched.c            |   65 ++++++++++++-
 block/elevator.c               |    6 +-
 include/linux/blkdev.h         |   81 +++++++++++++++-
 include/linux/elevator.h       |    8 ++
 8 files changed, 336 insertions(+), 105 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index b9a83dd..151458d 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -975,6 +975,16 @@ elevator_latter_req_fn		These return the request before or after the
 
 elevator_completed_req_fn	called when a request is completed.
 
+elevator_req_list_fn		called to obtain the active request list for
+				either a bio or a request. This function can
+				be called with either a valid bio or request.
+				This function must be defined only if the
+				scheduler supports per cgroup request lists.
+				When defined, it becomes the responsibility
+				of the scheduler to call blk_alloced_request
+				after the request and queue are allocated and
+				the cgroup is declared.
+
 elevator_may_queue_fn		returns true if the scheduler wants to allow the
 				current context to queue a new request even if
 				it is over the queue limit. This must be used
diff --git a/block/blk-core.c b/block/blk-core.c
index 48c1f00..d90b05c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -478,23 +478,28 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (unlikely(rl->rq_pool))
-		return 0;
-
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
 	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
+
+static int blk_init_free_list(struct request_queue *q)
+{
+	if (unlikely(q->rq_pool.rq_pool))
+		return 0;
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+	blk_init_request_list(&q->rq);
 
-	if (!rl->rq_pool)
+	q->rq_pool.count[BLK_RW_SYNC] = 0;
+	q->rq_pool.count[BLK_RW_ASYNC] = 0;
+	q->rq_pool.elvpriv = 0;
+	q->rq_pool.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+					mempool_alloc_slab, mempool_free_slab,
+					request_cachep, q->node);
+	if (!q->rq_pool.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -667,14 +672,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_pool.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_pool.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -684,11 +689,11 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
+		rq->cmd_flags |= REQ_ELVPRIV;
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_pool.rq_pool);
 			return NULL;
 		}
-		rq->cmd_flags |= REQ_ELVPRIV;
 	}
 
 	return rq;
@@ -728,49 +733,78 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+			    struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_pool.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_pool.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
 /*
  * A request has just been released.  Account for it, update the full and
- * congestion status, wake up any waiters.   Called under q->queue_lock.
+ * congestion status, wake up any waiters.  Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+			  struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
+	if (priv) {
+		q->rq_pool.elvpriv--;
+		BUG_ON(!rl->count[sync]);
+		rl->count[sync]--;
+	}
 
-	rl->count[sync]--;
-	if (priv)
-		rl->elvpriv--;
+	BUG_ON(!q->rq_pool.count[sync]);
+	q->rq_pool.count[sync]--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
 }
 
+void blk_alloced_request(struct request_queue *q, struct request_list *rl,
+			 int sync)
+{
+	int priv;
+
+	q->rq_pool.count[sync]++;
+	rl->starved[sync] = 0;
+
+	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+	if (priv) {
+		q->rq_pool.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[sync]++;
+	}
+}
+EXPORT_SYMBOL(blk_alloced_request);
+
 /*
  * Get a free request, queue_lock must be held.
  * Returns NULL on failure, with queue_lock held.
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+				   struct bio *bio, gfp_t gfp_mask,
+				   struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -779,31 +813,41 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
+	if (q->rq_pool.count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
+		blk_add_trace_msg(q, "Queue congested: setting congested flag");
+		blk_set_queue_congested(q, is_sync);
+	}
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_pool.count[is_sync]+1 >= q->nr_requests) {
+		blk_add_trace_msg(q, "Queue congested: setting full flag");
+		blk_set_queue_full(q, is_sync);
+	}
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else if (may_queue != ELV_MQUEUE_MUST
+				&& !ioc_batching(q, ioc)) {
 			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
+			 * The queue is full and the allocating
+			 * process is not a "batcher", and not
+			 * exempted by the IO scheduler
 			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
-			}
+			blk_add_trace_msg(q, "Queue congested: "
+						"not allocating request");
+			goto out;
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -811,15 +855,19 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2)) {
+		blk_add_trace_msg(q, "50 percent over limit, not allocating");
 		goto out;
-
-	rl->count[is_sync]++;
-	rl->starved[is_sync] = 0;
+	}
 
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * If the scheduler supports per cgroup request lists it will call
+	 * blk_alloced_request after the request and queue is allocated and
+	 * the cgroup has been decided.
+	 */
+	if (!blk_supports_cgroups(q) || !priv)
+		blk_alloced_request(q, rl, is_sync);
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
@@ -835,7 +883,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		if (!blk_supports_cgroups(q) || !priv)
+			freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -845,9 +894,11 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
+		if (unlikely(rl->count[is_sync] == 0)) {
+			blk_add_trace_msg(q, "Queue congested: "
+				"marking %d starved", is_sync);
 			rl->starved[is_sync] = 1;
-
+		}
 		goto out;
 	}
 
@@ -876,16 +927,23 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
+		/*
+		 * We are about to sleep on a request list and we
+		 * drop queue lock. After waking up, we will do
+		 * finish_wait() on request list and in the mean
+		 * time group might be gone. Take a reference to
+		 * the group now.
+		 */
 		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
 				TASK_UNINTERRUPTIBLE);
-
+		blk_get_rl_group(q, rl);
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
 		__generic_unplug_device(q);
@@ -904,7 +962,18 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * We had taken a reference to the request list goup.
+		 * Put that now
+		 */
+		blk_put_rl_group(q, rl);
+
+		/*
+		 * After the sleep check the rl again in case the group the bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -913,6 +982,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -920,7 +990,8 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rl = blk_get_request_list(q, NULL);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1121,12 +1192,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
+		BUG_ON(!rl);
 
+		freed_request(q, is_sync, priv, rl);
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 36c8c1f..8b18e08 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -158,7 +158,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
-
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
 	blk_queue_congestion_threshold(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 41fb691..2cd2618 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -39,7 +39,6 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
 	unsigned long nr;
 	int ret;
 
@@ -53,33 +52,31 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	spin_lock_irq(q->queue_lock);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
-		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
-		blk_clear_queue_congested(q, BLK_RW_SYNC);
-
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
-		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
-		blk_clear_queue_congested(q, BLK_RW_ASYNC);
-
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
-		blk_clear_queue_full(q, BLK_RW_SYNC);
-		wake_up(&rl->wait[BLK_RW_SYNC]);
-	}
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, page);
+}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
-		blk_clear_queue_full(q, BLK_RW_ASYNC);
-		wake_up(&rl->wait[BLK_RW_ASYNC]);
-	}
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+				size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -271,6 +268,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -381,6 +386,9 @@ static struct queue_sysfs_entry queue_random_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -467,14 +475,12 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
-
 	blk_throtl_exit(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_pool.rq_pool)
+		mempool_destroy(q->rq_pool.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fa5a34d..ab7a216 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -214,6 +214,9 @@ struct cfq_group {
 	enum wl_prio_t saved_serving_prio;
 	struct blkio_group blkg;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+	/* Request list associated with the group */
+	struct request_list rl;
+
 	struct hlist_node cfqd_node;
 	int ref;
 #endif
@@ -1133,6 +1136,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 					0);
 
 	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+	blk_init_request_list(&cfqg->rl);
 
 	/* Add group on cfqd list */
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -1225,6 +1229,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+	BUG_ON(cfqg->rl.count[0] | cfqg->rl.count[1]);
 	kfree(cfqg);
 }
 
@@ -3106,7 +3111,13 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue **async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);
+	/* If cfqg has not been allocated during request list allocation,
+	   do not create it now. Otherwise, request list counter could get
+	   off-by one errors.
+	 */
+	struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 0);
+	if (!cfqg)
+		cfqg = &cfqd->root_group;
 
 	if (!is_sync) {
 		async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3543,6 +3554,46 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+/*
+ * This function is essentially overloaded to return a request list
+ * given either a bio or a request. Only one of these is expected to
+ * be provided on any given invocation. The other must be NULL.
+ */
+static struct request_list *cfq_request_list(struct request_queue *q,
+					     struct bio *bio,
+					     struct request *rq)
+{
+	if (rq) {
+		struct cfq_queue *cfqq = RQ_CFQQ(rq);
+		return &cfqq->cfqg->rl;
+	} else {
+		struct cfq_data *cfqd = q->elevator->elevator_data;
+		struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 1);
+
+		if (!cfqg)
+			cfqg = &cfqd->root_group;
+		return &cfqg->rl;
+	}
+}
+
+#define RL_CFQG(rl) container_of((rl), struct cfq_group, rl)
+
+static void cfq_get_rl_group(struct request_list *rl)
+{
+	struct cfq_group *cfqg = RL_CFQG(rl);
+
+	cfqg->ref++;
+}
+
+static void cfq_put_rl_group(struct request_list *rl)
+{
+	struct cfq_group *cfqg = RL_CFQG(rl);
+
+	cfq_put_cfqg(cfqg);
+}
+#endif
+
 /*
  * Update hw_tag based on peak queue depth over 50 samples under
  * sufficient load.
@@ -3860,9 +3911,11 @@ new_queue:
 
 	cfqq->allocated[rw]++;
 	cfqq->ref++;
+
 	rq->elevator_private = cic;
 	rq->elevator_private2 = cfqq;
 	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+	blk_alloced_request(q, rq_rl(q, rq), is_sync != 0);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
@@ -4066,6 +4119,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	 */
 	cfqg->ref = 1;
 	rcu_read_lock();
+	blk_init_request_list(&cfqg->rl);
 	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
 					(void *)cfqd, 0);
 	rcu_read_unlock();
@@ -4108,7 +4162,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->cfq_group_idle = cfq_group_idle;
 	cfqd->cfq_latency = 1;
-	cfqd->cfq_group_isolation = 0;
 	cfqd->hw_tag = -1;
 	/*
 	 * we optimistically start assuming sync ops weren't delayed in last
@@ -4184,7 +4237,6 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
-SHOW_FUNCTION(cfq_group_isolation_show, cfqd->cfq_group_isolation, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -4218,7 +4270,6 @@ STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
-STORE_FUNCTION(cfq_group_isolation_store, &cfqd->cfq_group_isolation, 0, 1, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -4236,7 +4287,6 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(group_idle),
 	CFQ_ATTR(low_latency),
-	CFQ_ATTR(group_isolation),
 	__ATTR_NULL
 };
 
@@ -4255,6 +4305,11 @@ static struct elevator_type iosched_cfq = {
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+		.elevator_req_list_fn =		cfq_request_list,
+		.elevator_get_rl_group_fn =	cfq_get_rl_group,
+		.elevator_put_rl_group_fn =	cfq_put_rl_group,
+#endif
 		.elevator_set_req_fn =		cfq_set_request,
 		.elevator_put_req_fn =		cfq_put_request,
 		.elevator_may_queue_fn =	cfq_may_queue,
diff --git a/block/elevator.c b/block/elevator.c
index 9854cf6..64a823c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_pool.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -680,8 +680,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_pool.count[BLK_RW_SYNC] +
+			  q->rq_pool.count[BLK_RW_ASYNC] - queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..77c66a1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,7 +32,18 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	128	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ	128	/* Default per group maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue.
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
@@ -44,9 +55,21 @@ struct request_list {
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structure keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_pool {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count.
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
 };
 
 /*
@@ -269,6 +292,11 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/*
+	 * Contains request pool and other data like overall request count
+	 */
+	struct request_pool	rq_pool;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -329,6 +357,7 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -606,6 +635,53 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
 }
 #endif /* CONFIG_MMU */
 
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+							struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+	int priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+
+	if (priv && e->ops->elevator_req_list_fn)
+		return e->ops->elevator_req_list_fn(q, bio, NULL);
+	return &q->rq;
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+					 struct request *rq)
+{
+	struct elevator_queue *e = q->elevator;
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	if (priv && e->ops->elevator_req_list_fn)
+		return e->ops->elevator_req_list_fn(q, NULL, rq);
+	return &q->rq;
+}
+
+static inline void blk_get_rl_group(struct request_queue *q,
+				    struct request_list *rl)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e->ops->elevator_get_rl_group_fn)
+		e->ops->elevator_get_rl_group_fn(rl);
+}
+
+static inline void blk_put_rl_group(struct request_queue *q,
+				    struct request_list *rl)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e->ops->elevator_put_rl_group_fn)
+		e->ops->elevator_put_rl_group_fn(rl);
+}
+
+static inline int blk_supports_cgroups(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+
+	return (e->ops->elevator_req_list_fn) ? 1 : 0;
+}
+
 struct rq_map_data {
 	struct page **pages;
 	int page_order;
@@ -805,6 +881,7 @@ extern struct request_queue *blk_init_allocated_queue_node(struct request_queue
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
 extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
 						      request_fn_proc *, spinlock_t *);
+extern void blk_init_request_list(struct request_list *rl);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
@@ -857,6 +934,8 @@ int blk_get_queue(struct request_queue *);
 struct request_queue *blk_alloc_queue(gfp_t);
 struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
+extern void blk_alloced_request(struct request_queue *q,
+				struct request_list *rl, int sync);
 
 /*
  * tag stuff
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 496c182..f0bc28d 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,6 +22,10 @@ typedef int (elevator_dispatch_fn) (struct request_queue *, int);
 typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_queue_empty_fn) (struct request_queue *);
 typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
+typedef struct request_list *(elevator_req_list_fn)
+		(struct request_queue *, struct bio *, struct request *);
+typedef void (elevator_get_rl_group_fn)(struct request_list *);
+typedef void (elevator_put_rl_group_fn)(struct request_list *);
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
@@ -52,10 +56,14 @@ struct elevator_ops
 
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
+	elevator_req_list_fn *elevator_req_list_fn;
 
 	elevator_set_req_fn *elevator_set_req_fn;
 	elevator_put_req_fn *elevator_put_req_fn;
 
+	elevator_get_rl_group_fn *elevator_get_rl_group_fn;
+	elevator_put_rl_group_fn *elevator_put_rl_group_fn;
+
 	elevator_may_queue_fn *elevator_may_queue_fn;
 
 	elevator_init_fn *elevator_init_fn;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
                   ` (5 preceding siblings ...)
  2011-03-08 21:20 ` [PATCH 6/6] Per cgroup request descriptor counts Justin TerAvest
@ 2011-03-08 22:43 ` Vivek Goyal
  2011-03-08 22:50   ` Vivek Goyal
  2011-03-09  5:22 ` KAMEZAWA Hiroyuki
  7 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-08 22:43 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
> This patchset adds tracking to the page_cgroup structure for which cgroup has
> dirtied a page, and uses that information to provide isolation between
> cgroups performing writeback.
> 

Justin,

So if somebody is trying to isolate a workload which does bunch of READS
and lots of buffered WRITES, this patchset should help in the sense that
all the heavy WRITES can be put into a separate cgroup of low weight?

Other application which are primarily doing READS, direct WRITES or little
bit of buffered WRITES should still get good latencies if heavy writer
is isolated in a separate group?

If yes, then this piece standalone can make sense. And once the other
piece/patches of memory cgroup dirty ratio and cgroup aware buffered
writeout come in, then one will be able to differentiate buffered writes
of different groups.

Thanks
Vivek

> I know that there is some discussion to remove request descriptor limits
> entirely, but I included a patch to introduce per-cgroup limits to enable
> this functionality. Without it, we didn't see much isolation improvement.
> 
> I think most of this material has been discussed on lkml previously, this is
> just another attempt to make a patchset that handles buffered writes for CFQ.
> 
> There was a lot of previous discussion at:
>   http://thread.gmane.org/gmane.linux.kernel/1007922
> 
> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
> and Vivek Goyal for work on previous versions of these patches.
> 
> 
>  Documentation/block/biodoc.txt |   10 +
>  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
>  block/blk-cgroup.h             |    9 +-
>  block/blk-core.c               |  216 +++++++++++++++--------
>  block/blk-settings.c           |    2 +-
>  block/blk-sysfs.c              |   60 ++++---
>  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
>  block/cfq.h                    |    6 +-
>  block/elevator.c               |   11 +-
>  fs/buffer.c                    |    2 +
>  fs/direct-io.c                 |    2 +
>  include/linux/blk_types.h      |    2 +
>  include/linux/blkdev.h         |   81 ++++++++-
>  include/linux/blkio-track.h    |   89 +++++++++
>  include/linux/elevator.h       |   14 ++-
>  include/linux/iocontext.h      |    1 +
>  include/linux/memcontrol.h     |    6 +
>  include/linux/mmzone.h         |    4 +-
>  include/linux/page_cgroup.h    |   12 +-
>  init/Kconfig                   |   16 ++
>  mm/Makefile                    |    3 +-
>  mm/bounce.c                    |    2 +
>  mm/filemap.c                   |    2 +
>  mm/memcontrol.c                |    6 +
>  mm/memory.c                    |    6 +
>  mm/page-writeback.c            |   14 ++-
>  mm/page_cgroup.c               |   29 ++-
>  mm/swap_state.c                |    2 +
>  28 files changed, 985 insertions(+), 216 deletions(-)
> 
> [PATCH 1/6] Add IO cgroup tracking for buffered writes.
> [PATCH 2/6] Make async queues per cgroup.
> [PATCH 3/6] Modify CFQ to use IO tracking information.
> [PATCH 4/6] With per-cgroup async, don't special case queues.
> [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
> [PATCH 6/6] Per cgroup request descriptor counts

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-08 22:43 ` [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Vivek Goyal
@ 2011-03-08 22:50   ` Vivek Goyal
  2011-03-09 18:04     ` Justin TerAvest
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-08 22:50 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
> > This patchset adds tracking to the page_cgroup structure for which cgroup has
> > dirtied a page, and uses that information to provide isolation between
> > cgroups performing writeback.
> > 
> 
> Justin,
> 
> So if somebody is trying to isolate a workload which does bunch of READS
> and lots of buffered WRITES, this patchset should help in the sense that
> all the heavy WRITES can be put into a separate cgroup of low weight?
> 
> Other application which are primarily doing READS, direct WRITES or little
> bit of buffered WRITES should still get good latencies if heavy writer
> is isolated in a separate group?
> 
> If yes, then this piece standalone can make sense. And once the other
> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
> writeout come in, then one will be able to differentiate buffered writes
> of different groups.

Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
question would be will it help me enable get better isolation latencies
of READS agains buffered WRITES?

Thanks
Vivek 

> 
> Thanks
> Vivek
> 
> > I know that there is some discussion to remove request descriptor limits
> > entirely, but I included a patch to introduce per-cgroup limits to enable
> > this functionality. Without it, we didn't see much isolation improvement.
> > 
> > I think most of this material has been discussed on lkml previously, this is
> > just another attempt to make a patchset that handles buffered writes for CFQ.
> > 
> > There was a lot of previous discussion at:
> >   http://thread.gmane.org/gmane.linux.kernel/1007922
> > 
> > Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
> > and Vivek Goyal for work on previous versions of these patches.
> > 
> > 
> >  Documentation/block/biodoc.txt |   10 +
> >  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
> >  block/blk-cgroup.h             |    9 +-
> >  block/blk-core.c               |  216 +++++++++++++++--------
> >  block/blk-settings.c           |    2 +-
> >  block/blk-sysfs.c              |   60 ++++---
> >  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
> >  block/cfq.h                    |    6 +-
> >  block/elevator.c               |   11 +-
> >  fs/buffer.c                    |    2 +
> >  fs/direct-io.c                 |    2 +
> >  include/linux/blk_types.h      |    2 +
> >  include/linux/blkdev.h         |   81 ++++++++-
> >  include/linux/blkio-track.h    |   89 +++++++++
> >  include/linux/elevator.h       |   14 ++-
> >  include/linux/iocontext.h      |    1 +
> >  include/linux/memcontrol.h     |    6 +
> >  include/linux/mmzone.h         |    4 +-
> >  include/linux/page_cgroup.h    |   12 +-
> >  init/Kconfig                   |   16 ++
> >  mm/Makefile                    |    3 +-
> >  mm/bounce.c                    |    2 +
> >  mm/filemap.c                   |    2 +
> >  mm/memcontrol.c                |    6 +
> >  mm/memory.c                    |    6 +
> >  mm/page-writeback.c            |   14 ++-
> >  mm/page_cgroup.c               |   29 ++-
> >  mm/swap_state.c                |    2 +
> >  28 files changed, 985 insertions(+), 216 deletions(-)
> > 
> > [PATCH 1/6] Add IO cgroup tracking for buffered writes.
> > [PATCH 2/6] Make async queues per cgroup.
> > [PATCH 3/6] Modify CFQ to use IO tracking information.
> > [PATCH 4/6] With per-cgroup async, don't special case queues.
> > [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
> > [PATCH 6/6] Per cgroup request descriptor counts

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
                   ` (6 preceding siblings ...)
  2011-03-08 22:43 ` [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Vivek Goyal
@ 2011-03-09  5:22 ` KAMEZAWA Hiroyuki
  2011-03-09 15:19   ` Vivek Goyal
                     ` (2 more replies)
  7 siblings, 3 replies; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  5:22 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, vgoyal, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue,  8 Mar 2011 13:20:50 -0800
Justin TerAvest <teravest@google.com> wrote:

> This patchset adds tracking to the page_cgroup structure for which cgroup has
> dirtied a page, and uses that information to provide isolation between
> cgroups performing writeback.
> 
> I know that there is some discussion to remove request descriptor limits
> entirely, but I included a patch to introduce per-cgroup limits to enable
> this functionality. Without it, we didn't see much isolation improvement.
> 
> I think most of this material has been discussed on lkml previously, this is
> just another attempt to make a patchset that handles buffered writes for CFQ.
> 
> There was a lot of previous discussion at:
>   http://thread.gmane.org/gmane.linux.kernel/1007922
> 
> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
> and Vivek Goyal for work on previous versions of these patches.
> 

2 points from me.

1. If you know there is a thread, please join even if it's a bit little thread.
2. In these days, I got many comments as 'page_cgroup is toooo big', so
   I don't feel fun when I see a patch which increases size of page_cgroup.

I don't like to increase size of page_cgroup but I think you can record 
information without increasing size of page_cgroup.

A) As Andrea did, encode it to pc->flags.
   But I'm afraid that there is a racy case because memory cgroup uses some
   test_and_set() bits.
B) I wonder why the information cannot be recorded in page->private.
   When page has buffers, you can record the information to buffer struct.
   About swapio (if you take care of), you can record information to bio.

Anyway, thank you for the work. And please join the discussion and explain
"Without it, we didn't see much isolation improvement." with real data.

Regards,
-Kame



> 
>  Documentation/block/biodoc.txt |   10 +
>  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
>  block/blk-cgroup.h             |    9 +-
>  block/blk-core.c               |  216 +++++++++++++++--------
>  block/blk-settings.c           |    2 +-
>  block/blk-sysfs.c              |   60 ++++---
>  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
>  block/cfq.h                    |    6 +-
>  block/elevator.c               |   11 +-
>  fs/buffer.c                    |    2 +
>  fs/direct-io.c                 |    2 +
>  include/linux/blk_types.h      |    2 +
>  include/linux/blkdev.h         |   81 ++++++++-
>  include/linux/blkio-track.h    |   89 +++++++++
>  include/linux/elevator.h       |   14 ++-
>  include/linux/iocontext.h      |    1 +
>  include/linux/memcontrol.h     |    6 +
>  include/linux/mmzone.h         |    4 +-
>  include/linux/page_cgroup.h    |   12 +-
>  init/Kconfig                   |   16 ++
>  mm/Makefile                    |    3 +-
>  mm/bounce.c                    |    2 +
>  mm/filemap.c                   |    2 +
>  mm/memcontrol.c                |    6 +
>  mm/memory.c                    |    6 +
>  mm/page-writeback.c            |   14 ++-
>  mm/page_cgroup.c               |   29 ++-
>  mm/swap_state.c                |    2 +
>  28 files changed, 985 insertions(+), 216 deletions(-)
> 
> [PATCH 1/6] Add IO cgroup tracking for buffered writes.
> [PATCH 2/6] Make async queues per cgroup.
> [PATCH 3/6] Modify CFQ to use IO tracking information.
> [PATCH 4/6] With per-cgroup async, don't special case queues.
> [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
> [PATCH 6/6] Per cgroup request descriptor counts
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-09  5:22 ` KAMEZAWA Hiroyuki
@ 2011-03-09 15:19   ` Vivek Goyal
  2011-03-09 18:05   ` Justin TerAvest
  2011-03-10 18:08   ` Justin TerAvest
  2 siblings, 0 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-03-09 15:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Justin TerAvest, m-ikeda, jaxboe, linux-kernel, ryov, taka,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Wed, Mar 09, 2011 at 02:22:37PM +0900, KAMEZAWA Hiroyuki wrote:
[..]

> B) I wonder why the information cannot be recorded in page->private.
>    When page has buffers, you can record the information to buffer struct.
>    About swapio (if you take care of), you can record information to bio.
> 

This sounds very interesting. I am not a vm guy hence I have no idea in
what situations page->private is used and is it ok, to introduce one
more user of it. But if it works, then getting rid of this dependency
on page_cgroup is a huge plug, IMHO.

> Anyway, thank you for the work. And please join the discussion and explain
> "Without it, we didn't see much isolation improvement." with real data.

Agreed. AFAICS, this piece is not going to help much until and unless
we have upper layers fixed (per cgroup dirty ratio and cgroup aware writeout).
So once those two are fixed and in kernel, it will pave the way for
this piece to go in.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-08 22:50   ` Vivek Goyal
@ 2011-03-09 18:04     ` Justin TerAvest
  2011-03-11  2:47       ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Justin TerAvest @ 2011-03-09 18:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
>> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
>> > This patchset adds tracking to the page_cgroup structure for which cgroup has
>> > dirtied a page, and uses that information to provide isolation between
>> > cgroups performing writeback.
>> >
>>
>> Justin,
>>
>> So if somebody is trying to isolate a workload which does bunch of READS
>> and lots of buffered WRITES, this patchset should help in the sense that
>> all the heavy WRITES can be put into a separate cgroup of low weight?
>>
>> Other application which are primarily doing READS, direct WRITES or little
>> bit of buffered WRITES should still get good latencies if heavy writer
>> is isolated in a separate group?
>>
>> If yes, then this piece standalone can make sense. And once the other
>> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
>> writeout come in, then one will be able to differentiate buffered writes
>> of different groups.
>
> Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
> question would be will it help me enable get better isolation latencies
> of READS agains buffered WRITES?

Ah! Sorry, I left out a patch that disables cross-group preemption.
I'll add that to the patchset and email out v2 soon.

>
> Thanks
> Vivek
>
>>
>> Thanks
>> Vivek
>>
>> > I know that there is some discussion to remove request descriptor limits
>> > entirely, but I included a patch to introduce per-cgroup limits to enable
>> > this functionality. Without it, we didn't see much isolation improvement.
>> >
>> > I think most of this material has been discussed on lkml previously, this is
>> > just another attempt to make a patchset that handles buffered writes for CFQ.
>> >
>> > There was a lot of previous discussion at:
>> >   http://thread.gmane.org/gmane.linux.kernel/1007922
>> >
>> > Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
>> > and Vivek Goyal for work on previous versions of these patches.
>> >
>> >
>> >  Documentation/block/biodoc.txt |   10 +
>> >  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
>> >  block/blk-cgroup.h             |    9 +-
>> >  block/blk-core.c               |  216 +++++++++++++++--------
>> >  block/blk-settings.c           |    2 +-
>> >  block/blk-sysfs.c              |   60 ++++---
>> >  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
>> >  block/cfq.h                    |    6 +-
>> >  block/elevator.c               |   11 +-
>> >  fs/buffer.c                    |    2 +
>> >  fs/direct-io.c                 |    2 +
>> >  include/linux/blk_types.h      |    2 +
>> >  include/linux/blkdev.h         |   81 ++++++++-
>> >  include/linux/blkio-track.h    |   89 +++++++++
>> >  include/linux/elevator.h       |   14 ++-
>> >  include/linux/iocontext.h      |    1 +
>> >  include/linux/memcontrol.h     |    6 +
>> >  include/linux/mmzone.h         |    4 +-
>> >  include/linux/page_cgroup.h    |   12 +-
>> >  init/Kconfig                   |   16 ++
>> >  mm/Makefile                    |    3 +-
>> >  mm/bounce.c                    |    2 +
>> >  mm/filemap.c                   |    2 +
>> >  mm/memcontrol.c                |    6 +
>> >  mm/memory.c                    |    6 +
>> >  mm/page-writeback.c            |   14 ++-
>> >  mm/page_cgroup.c               |   29 ++-
>> >  mm/swap_state.c                |    2 +
>> >  28 files changed, 985 insertions(+), 216 deletions(-)
>> >
>> > [PATCH 1/6] Add IO cgroup tracking for buffered writes.
>> > [PATCH 2/6] Make async queues per cgroup.
>> > [PATCH 3/6] Modify CFQ to use IO tracking information.
>> > [PATCH 4/6] With per-cgroup async, don't special case queues.
>> > [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
>> > [PATCH 6/6] Per cgroup request descriptor counts
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-09  5:22 ` KAMEZAWA Hiroyuki
  2011-03-09 15:19   ` Vivek Goyal
@ 2011-03-09 18:05   ` Justin TerAvest
  2011-03-10 18:08   ` Justin TerAvest
  2 siblings, 0 replies; 33+ messages in thread
From: Justin TerAvest @ 2011-03-09 18:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: m-ikeda, jaxboe, vgoyal, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 8, 2011 at 9:22 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue,  8 Mar 2011 13:20:50 -0800
> Justin TerAvest <teravest@google.com> wrote:
>
>> This patchset adds tracking to the page_cgroup structure for which cgroup has
>> dirtied a page, and uses that information to provide isolation between
>> cgroups performing writeback.
>>
>> I know that there is some discussion to remove request descriptor limits
>> entirely, but I included a patch to introduce per-cgroup limits to enable
>> this functionality. Without it, we didn't see much isolation improvement.
>>
>> I think most of this material has been discussed on lkml previously, this is
>> just another attempt to make a patchset that handles buffered writes for CFQ.
>>
>> There was a lot of previous discussion at:
>>   http://thread.gmane.org/gmane.linux.kernel/1007922
>>
>> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
>> and Vivek Goyal for work on previous versions of these patches.
>>
>
> 2 points from me.
>
> 1. If you know there is a thread, please join even if it's a bit little thread.
> 2. In these days, I got many comments as 'page_cgroup is toooo big', so
>   I don't feel fun when I see a patch which increases size of page_cgroup.
>
> I don't like to increase size of page_cgroup but I think you can record
> information without increasing size of page_cgroup.

That sounds reasonable.

>
> A) As Andrea did, encode it to pc->flags.
>   But I'm afraid that there is a racy case because memory cgroup uses some
>   test_and_set() bits.
> B) I wonder why the information cannot be recorded in page->private.
>   When page has buffers, you can record the information to buffer struct.
>   About swapio (if you take care of), you can record information to bio.
>
> Anyway, thank you for the work. And please join the discussion and explain
> "Without it, we didn't see much isolation improvement." with real data.

Thanks. I will include real data in the cover sheet for v2 of the
patchset, which I'll mail out soon.

Thanks,
Justin

>
> Regards,
> -Kame
>
>
>
>>
>>  Documentation/block/biodoc.txt |   10 +
>>  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
>>  block/blk-cgroup.h             |    9 +-
>>  block/blk-core.c               |  216 +++++++++++++++--------
>>  block/blk-settings.c           |    2 +-
>>  block/blk-sysfs.c              |   60 ++++---
>>  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
>>  block/cfq.h                    |    6 +-
>>  block/elevator.c               |   11 +-
>>  fs/buffer.c                    |    2 +
>>  fs/direct-io.c                 |    2 +
>>  include/linux/blk_types.h      |    2 +
>>  include/linux/blkdev.h         |   81 ++++++++-
>>  include/linux/blkio-track.h    |   89 +++++++++
>>  include/linux/elevator.h       |   14 ++-
>>  include/linux/iocontext.h      |    1 +
>>  include/linux/memcontrol.h     |    6 +
>>  include/linux/mmzone.h         |    4 +-
>>  include/linux/page_cgroup.h    |   12 +-
>>  init/Kconfig                   |   16 ++
>>  mm/Makefile                    |    3 +-
>>  mm/bounce.c                    |    2 +
>>  mm/filemap.c                   |    2 +
>>  mm/memcontrol.c                |    6 +
>>  mm/memory.c                    |    6 +
>>  mm/page-writeback.c            |   14 ++-
>>  mm/page_cgroup.c               |   29 ++-
>>  mm/swap_state.c                |    2 +
>>  28 files changed, 985 insertions(+), 216 deletions(-)
>>
>> [PATCH 1/6] Add IO cgroup tracking for buffered writes.
>> [PATCH 2/6] Make async queues per cgroup.
>> [PATCH 3/6] Modify CFQ to use IO tracking information.
>> [PATCH 4/6] With per-cgroup async, don't special case queues.
>> [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
>> [PATCH 6/6] Per cgroup request descriptor counts
>>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-09  5:22 ` KAMEZAWA Hiroyuki
  2011-03-09 15:19   ` Vivek Goyal
  2011-03-09 18:05   ` Justin TerAvest
@ 2011-03-10 18:08   ` Justin TerAvest
  2011-03-10 18:15     ` Vivek Goyal
  2 siblings, 1 reply; 33+ messages in thread
From: Justin TerAvest @ 2011-03-10 18:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: m-ikeda, jaxboe, vgoyal, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 8, 2011 at 9:22 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue,  8 Mar 2011 13:20:50 -0800
> Justin TerAvest <teravest@google.com> wrote:
>
>> This patchset adds tracking to the page_cgroup structure for which cgroup has
>> dirtied a page, and uses that information to provide isolation between
>> cgroups performing writeback.
>>
>> I know that there is some discussion to remove request descriptor limits
>> entirely, but I included a patch to introduce per-cgroup limits to enable
>> this functionality. Without it, we didn't see much isolation improvement.
>>
>> I think most of this material has been discussed on lkml previously, this is
>> just another attempt to make a patchset that handles buffered writes for CFQ.
>>
>> There was a lot of previous discussion at:
>>   http://thread.gmane.org/gmane.linux.kernel/1007922
>>
>> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
>> and Vivek Goyal for work on previous versions of these patches.
>>
>
> 2 points from me.
>
> 1. If you know there is a thread, please join even if it's a bit little thread.
> 2. In these days, I got many comments as 'page_cgroup is toooo big', so
>   I don't feel fun when I see a patch which increases size of page_cgroup.
>
> I don't like to increase size of page_cgroup but I think you can record
> information without increasing size of page_cgroup.
>
> A) As Andrea did, encode it to pc->flags.
>   But I'm afraid that there is a racy case because memory cgroup uses some
>   test_and_set() bits.
> B) I wonder why the information cannot be recorded in page->private.
>   When page has buffers, you can record the information to buffer struct.
>   About swapio (if you take care of), you can record information to bio.

Hi Kame,

I'm concerned that by using something like buffer_heads stored in
page->private, we will only be supported on some filesystems and not
others. In addition, I'm not sure if all filesystems attach buffer
heads at the same time; if page->private is modified in the flusher
thread, we might not be able to determine the thread that dirtied the
page in the first place.

For now, I think I will encode the data to pc->flags. Let me know if I
misunderstood your suggestion.

Thanks,
Justin

>
> Anyway, thank you for the work. And please join the discussion and explain
> "Without it, we didn't see much isolation improvement." with real data.
>
> Regards,
> -Kame
>
>
>
>>
>>  Documentation/block/biodoc.txt |   10 +
>>  block/blk-cgroup.c             |  204 +++++++++++++++++++++-
>>  block/blk-cgroup.h             |    9 +-
>>  block/blk-core.c               |  216 +++++++++++++++--------
>>  block/blk-settings.c           |    2 +-
>>  block/blk-sysfs.c              |   60 ++++---
>>  block/cfq-iosched.c            |  390 +++++++++++++++++++++++++++++++---------
>>  block/cfq.h                    |    6 +-
>>  block/elevator.c               |   11 +-
>>  fs/buffer.c                    |    2 +
>>  fs/direct-io.c                 |    2 +
>>  include/linux/blk_types.h      |    2 +
>>  include/linux/blkdev.h         |   81 ++++++++-
>>  include/linux/blkio-track.h    |   89 +++++++++
>>  include/linux/elevator.h       |   14 ++-
>>  include/linux/iocontext.h      |    1 +
>>  include/linux/memcontrol.h     |    6 +
>>  include/linux/mmzone.h         |    4 +-
>>  include/linux/page_cgroup.h    |   12 +-
>>  init/Kconfig                   |   16 ++
>>  mm/Makefile                    |    3 +-
>>  mm/bounce.c                    |    2 +
>>  mm/filemap.c                   |    2 +
>>  mm/memcontrol.c                |    6 +
>>  mm/memory.c                    |    6 +
>>  mm/page-writeback.c            |   14 ++-
>>  mm/page_cgroup.c               |   29 ++-
>>  mm/swap_state.c                |    2 +
>>  28 files changed, 985 insertions(+), 216 deletions(-)
>>
>> [PATCH 1/6] Add IO cgroup tracking for buffered writes.
>> [PATCH 2/6] Make async queues per cgroup.
>> [PATCH 3/6] Modify CFQ to use IO tracking information.
>> [PATCH 4/6] With per-cgroup async, don't special case queues.
>> [PATCH 5/6] Add stat for per cgroup writeout done by flusher.
>> [PATCH 6/6] Per cgroup request descriptor counts
>>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-10 18:08   ` Justin TerAvest
@ 2011-03-10 18:15     ` Vivek Goyal
  2011-03-10 18:57       ` Justin TerAvest
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-10 18:15 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote:

[..]
> > I don't like to increase size of page_cgroup but I think you can record
> > information without increasing size of page_cgroup.
> >
> > A) As Andrea did, encode it to pc->flags.
> >   But I'm afraid that there is a racy case because memory cgroup uses some
> >   test_and_set() bits.
> > B) I wonder why the information cannot be recorded in page->private.
> >   When page has buffers, you can record the information to buffer struct.
> >   About swapio (if you take care of), you can record information to bio.
> 
> Hi Kame,
> 
> I'm concerned that by using something like buffer_heads stored in
> page->private, we will only be supported on some filesystems and not
> others. In addition, I'm not sure if all filesystems attach buffer
> heads at the same time; if page->private is modified in the flusher
> thread, we might not be able to determine the thread that dirtied the
> page in the first place.

I think the person who dirtied the page can store the information in
page->private (assuming buffer heads were not generated) and if flusher
thread later ends up generating buffer heads and ends up modifying
page->private, this can be copied in buffer heads?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-10 18:15     ` Vivek Goyal
@ 2011-03-10 18:57       ` Justin TerAvest
  2011-03-10 19:11         ` [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Justin TerAvest @ 2011-03-10 18:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote:
>
> [..]
>> > I don't like to increase size of page_cgroup but I think you can record
>> > information without increasing size of page_cgroup.
>> >
>> > A) As Andrea did, encode it to pc->flags.
>> >   But I'm afraid that there is a racy case because memory cgroup uses some
>> >   test_and_set() bits.
>> > B) I wonder why the information cannot be recorded in page->private.
>> >   When page has buffers, you can record the information to buffer struct.
>> >   About swapio (if you take care of), you can record information to bio.
>>
>> Hi Kame,
>>
>> I'm concerned that by using something like buffer_heads stored in
>> page->private, we will only be supported on some filesystems and not
>> others. In addition, I'm not sure if all filesystems attach buffer
>> heads at the same time; if page->private is modified in the flusher
>> thread, we might not be able to determine the thread that dirtied the
>> page in the first place.
>
> I think the person who dirtied the page can store the information in
> page->private (assuming buffer heads were not generated) and if flusher
> thread later ends up generating buffer heads and ends up modifying
> page->private, this can be copied in buffer heads?

This scares me a bit.

As I understand it, fs/ code expects total ownership of page->private.
This adds a responsibility for every user to copy the data through and
store it in the buffer head (or anything else). btrfs seems to do
something entirely different in some cases and store a different kind
of value.

I don't know that it's right to add the burden to copy the original
value to everything that wants to use page->private.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 18:57       ` Justin TerAvest
@ 2011-03-10 19:11         ` Vivek Goyal
  2011-03-10 19:41           ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-10 19:11 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin,
	linux-fsdevel, Chris Mason

On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote:
> On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote:
> >
> > [..]
> >> > I don't like to increase size of page_cgroup but I think you can record
> >> > information without increasing size of page_cgroup.
> >> >
> >> > A) As Andrea did, encode it to pc->flags.
> >> >   But I'm afraid that there is a racy case because memory cgroup uses some
> >> >   test_and_set() bits.
> >> > B) I wonder why the information cannot be recorded in page->private.
> >> >   When page has buffers, you can record the information to buffer struct.
> >> >   About swapio (if you take care of), you can record information to bio.
> >>
> >> Hi Kame,
> >>
> >> I'm concerned that by using something like buffer_heads stored in
> >> page->private, we will only be supported on some filesystems and not
> >> others. In addition, I'm not sure if all filesystems attach buffer
> >> heads at the same time; if page->private is modified in the flusher
> >> thread, we might not be able to determine the thread that dirtied the
> >> page in the first place.
> >
> > I think the person who dirtied the page can store the information in
> > page->private (assuming buffer heads were not generated) and if flusher
> > thread later ends up generating buffer heads and ends up modifying
> > page->private, this can be copied in buffer heads?
> 
> This scares me a bit.
> 
> As I understand it, fs/ code expects total ownership of page->private.
> This adds a responsibility for every user to copy the data through and
> store it in the buffer head (or anything else). btrfs seems to do
> something entirely different in some cases and store a different kind
> of value.

If filesystems are using page->private for some other purpose also, then
I guess we have issues. 

I am ccing linux-fsdevel to have some feedback on the idea of trying
to store cgroup id of page dirtying thread in page->private and/or buffer
head for tracking which group originally dirtied the page in IO controller
during writeback.

> 
> I don't know that it's right to add the burden to copy the original
> value to everything that wants to use page->private.
> 

How many such places are there?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 19:11         ` [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) Vivek Goyal
@ 2011-03-10 19:41           ` Vivek Goyal
  2011-03-10 21:15             ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-10 19:41 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin,
	linux-fsdevel, Chris Mason

On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote:
> > On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote:
> > >
> > > [..]
> > >> > I don't like to increase size of page_cgroup but I think you can record
> > >> > information without increasing size of page_cgroup.
> > >> >
> > >> > A) As Andrea did, encode it to pc->flags.
> > >> >   But I'm afraid that there is a racy case because memory cgroup uses some
> > >> >   test_and_set() bits.
> > >> > B) I wonder why the information cannot be recorded in page->private.
> > >> >   When page has buffers, you can record the information to buffer struct.
> > >> >   About swapio (if you take care of), you can record information to bio.
> > >>
> > >> Hi Kame,
> > >>
> > >> I'm concerned that by using something like buffer_heads stored in
> > >> page->private, we will only be supported on some filesystems and not
> > >> others. In addition, I'm not sure if all filesystems attach buffer
> > >> heads at the same time; if page->private is modified in the flusher
> > >> thread, we might not be able to determine the thread that dirtied the
> > >> page in the first place.
> > >
> > > I think the person who dirtied the page can store the information in
> > > page->private (assuming buffer heads were not generated) and if flusher
> > > thread later ends up generating buffer heads and ends up modifying
> > > page->private, this can be copied in buffer heads?
> > 
> > This scares me a bit.
> > 
> > As I understand it, fs/ code expects total ownership of page->private.
> > This adds a responsibility for every user to copy the data through and
> > store it in the buffer head (or anything else). btrfs seems to do
> > something entirely different in some cases and store a different kind
> > of value.
> 
> If filesystems are using page->private for some other purpose also, then
> I guess we have issues. 
> 
> I am ccing linux-fsdevel to have some feedback on the idea of trying
> to store cgroup id of page dirtying thread in page->private and/or buffer
> head for tracking which group originally dirtied the page in IO controller
> during writeback.

A quick "grep" showed that btrfs, ceph and logfs are using page->private
for other purposes also.

I was under the impression that either page->private is null or it 
points to buffer heads for the writeback case. So storing the info
directly in either buffer head directly or first in page->private and
then transferring it to buffer heads would have helped. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 19:41           ` Vivek Goyal
@ 2011-03-10 21:15             ` Chris Mason
  2011-03-10 21:24               ` Andreas Dilger
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Mason @ 2011-03-10 21:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe,
	linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir,
	ctalbott, nauman, mrubin, linux-fsdevel

Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> > On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote:
> > > On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote:
> > > >
> > > > [..]
> > > >> > I don't like to increase size of page_cgroup but I think you can record
> > > >> > information without increasing size of page_cgroup.
> > > >> >
> > > >> > A) As Andrea did, encode it to pc->flags.
> > > >> >   But I'm afraid that there is a racy case because memory cgroup uses some
> > > >> >   test_and_set() bits.
> > > >> > B) I wonder why the information cannot be recorded in page->private.
> > > >> >   When page has buffers, you can record the information to buffer struct.
> > > >> >   About swapio (if you take care of), you can record information to bio.
> > > >>
> > > >> Hi Kame,
> > > >>
> > > >> I'm concerned that by using something like buffer_heads stored in
> > > >> page->private, we will only be supported on some filesystems and not
> > > >> others. In addition, I'm not sure if all filesystems attach buffer
> > > >> heads at the same time; if page->private is modified in the flusher
> > > >> thread, we might not be able to determine the thread that dirtied the
> > > >> page in the first place.
> > > >
> > > > I think the person who dirtied the page can store the information in
> > > > page->private (assuming buffer heads were not generated) and if flusher
> > > > thread later ends up generating buffer heads and ends up modifying
> > > > page->private, this can be copied in buffer heads?
> > > 
> > > This scares me a bit.
> > > 
> > > As I understand it, fs/ code expects total ownership of page->private.
> > > This adds a responsibility for every user to copy the data through and
> > > store it in the buffer head (or anything else). btrfs seems to do
> > > something entirely different in some cases and store a different kind
> > > of value.
> > 
> > If filesystems are using page->private for some other purpose also, then
> > I guess we have issues. 
> > 
> > I am ccing linux-fsdevel to have some feedback on the idea of trying
> > to store cgroup id of page dirtying thread in page->private and/or buffer
> > head for tracking which group originally dirtied the page in IO controller
> > during writeback.
> 
> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> for other purposes also.
> 
> I was under the impression that either page->private is null or it 
> points to buffer heads for the writeback case. So storing the info
> directly in either buffer head directly or first in page->private and
> then transferring it to buffer heads would have helped. 

Right, btrfs has its own uses for page->private, and we expect to own
it.  With a proper callback, the FS could store the extra information you
need in out own structs.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 21:15             ` Chris Mason
@ 2011-03-10 21:24               ` Andreas Dilger
  2011-03-10 21:38                 ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Andreas Dilger @ 2011-03-10 21:24 UTC (permalink / raw)
  To: Chris Mason
  Cc: Vivek Goyal, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe,
	linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir,
	ctalbott, nauman, mrubin, linux-fsdevel

On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
>> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
>>>>> I think the person who dirtied the page can store the information in
>>>>> page->private (assuming buffer heads were not generated) and if flusher
>>>>> thread later ends up generating buffer heads and ends up modifying
>>>>> page->private, this can be copied in buffer heads?
>>>> 
>>>> This scares me a bit.
>>>> 
>>>> As I understand it, fs/ code expects total ownership of page->private.
>>>> This adds a responsibility for every user to copy the data through and
>>>> store it in the buffer head (or anything else). btrfs seems to do
>>>> something entirely different in some cases and store a different kind
>>>> of value.
>>> 
>>> If filesystems are using page->private for some other purpose also, then
>>> I guess we have issues. 
>>> 
>>> I am ccing linux-fsdevel to have some feedback on the idea of trying
>>> to store cgroup id of page dirtying thread in page->private and/or buffer
>>> head for tracking which group originally dirtied the page in IO controller
>>> during writeback.
>> 
>> A quick "grep" showed that btrfs, ceph and logfs are using page->private
>> for other purposes also.
>> 
>> I was under the impression that either page->private is null or it 
>> points to buffer heads for the writeback case. So storing the info
>> directly in either buffer head directly or first in page->private and
>> then transferring it to buffer heads would have helped. 
> 
> Right, btrfs has its own uses for page->private, and we expect to own
> it.  With a proper callback, the FS could store the extra information you
> need in out own structs.

There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all).  Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 21:24               ` Andreas Dilger
@ 2011-03-10 21:38                 ` Vivek Goyal
  2011-03-10 21:43                   ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-10 21:38 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Chris Mason, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe,
	linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir,
	ctalbott, nauman, mrubin, linux-fsdevel

On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote:
> On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> >>>>> I think the person who dirtied the page can store the information in
> >>>>> page->private (assuming buffer heads were not generated) and if flusher
> >>>>> thread later ends up generating buffer heads and ends up modifying
> >>>>> page->private, this can be copied in buffer heads?
> >>>> 
> >>>> This scares me a bit.
> >>>> 
> >>>> As I understand it, fs/ code expects total ownership of page->private.
> >>>> This adds a responsibility for every user to copy the data through and
> >>>> store it in the buffer head (or anything else). btrfs seems to do
> >>>> something entirely different in some cases and store a different kind
> >>>> of value.
> >>> 
> >>> If filesystems are using page->private for some other purpose also, then
> >>> I guess we have issues. 
> >>> 
> >>> I am ccing linux-fsdevel to have some feedback on the idea of trying
> >>> to store cgroup id of page dirtying thread in page->private and/or buffer
> >>> head for tracking which group originally dirtied the page in IO controller
> >>> during writeback.
> >> 
> >> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> >> for other purposes also.
> >> 
> >> I was under the impression that either page->private is null or it 
> >> points to buffer heads for the writeback case. So storing the info
> >> directly in either buffer head directly or first in page->private and
> >> then transferring it to buffer heads would have helped. 
> > 
> > Right, btrfs has its own uses for page->private, and we expect to own
> > it.  With a proper callback, the FS could store the extra information you
> > need in out own structs.
> 
> There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all).  Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken.

Andreas,

As Chris mentioned, will providing callbacks so that filesystem can
save/restore page->private be reasonable?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 21:38                 ` Vivek Goyal
@ 2011-03-10 21:43                   ` Chris Mason
  2011-03-11  1:20                     ` KAMEZAWA Hiroyuki
  2011-03-11  1:46                     ` Dave Chinner
  0 siblings, 2 replies; 33+ messages in thread
From: Chris Mason @ 2011-03-10 21:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda,
	jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng,
	balbir, ctalbott, nauman, mrubin, linux-fsdevel

Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500:
> On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote:
> > On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> > >>>>> I think the person who dirtied the page can store the information in
> > >>>>> page->private (assuming buffer heads were not generated) and if flusher
> > >>>>> thread later ends up generating buffer heads and ends up modifying
> > >>>>> page->private, this can be copied in buffer heads?
> > >>>> 
> > >>>> This scares me a bit.
> > >>>> 
> > >>>> As I understand it, fs/ code expects total ownership of page->private.
> > >>>> This adds a responsibility for every user to copy the data through and
> > >>>> store it in the buffer head (or anything else). btrfs seems to do
> > >>>> something entirely different in some cases and store a different kind
> > >>>> of value.
> > >>> 
> > >>> If filesystems are using page->private for some other purpose also, then
> > >>> I guess we have issues. 
> > >>> 
> > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying
> > >>> to store cgroup id of page dirtying thread in page->private and/or buffer
> > >>> head for tracking which group originally dirtied the page in IO controller
> > >>> during writeback.
> > >> 
> > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> > >> for other purposes also.
> > >> 
> > >> I was under the impression that either page->private is null or it 
> > >> points to buffer heads for the writeback case. So storing the info
> > >> directly in either buffer head directly or first in page->private and
> > >> then transferring it to buffer heads would have helped. 
> > > 
> > > Right, btrfs has its own uses for page->private, and we expect to own
> > > it.  With a proper callback, the FS could store the extra information you
> > > need in out own structs.
> > 
> > There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all).  Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken.
> 
> Andreas,
> 
> As Chris mentioned, will providing callbacks so that filesystem can
> save/restore page->private be reasonable?

Just to clarify, I think saving/restoring page->private is going to be
hard.  I'd rather just have a call back that says here's a page, storage
this for the block io controller please, and another one that returns
any previously stored info.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 21:43                   ` Chris Mason
@ 2011-03-11  1:20                     ` KAMEZAWA Hiroyuki
  2011-03-11  1:46                     ` Dave Chinner
  1 sibling, 0 replies; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-11  1:20 UTC (permalink / raw)
  To: Chris Mason
  Cc: Vivek Goyal, Andreas Dilger, Justin TerAvest, m-ikeda, jaxboe,
	linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir,
	ctalbott, nauman, mrubin, linux-fsdevel

On Thu, 10 Mar 2011 16:43:31 -0500
Chris Mason <chris.mason@oracle.com> wrote:

> Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500:
> > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote:
> > > On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> > > >>>>> I think the person who dirtied the page can store the information in
> > > >>>>> page->private (assuming buffer heads were not generated) and if flusher
> > > >>>>> thread later ends up generating buffer heads and ends up modifying
> > > >>>>> page->private, this can be copied in buffer heads?
> > > >>>> 
> > > >>>> This scares me a bit.
> > > >>>> 
> > > >>>> As I understand it, fs/ code expects total ownership of page->private.
> > > >>>> This adds a responsibility for every user to copy the data through and
> > > >>>> store it in the buffer head (or anything else). btrfs seems to do
> > > >>>> something entirely different in some cases and store a different kind
> > > >>>> of value.
> > > >>> 
> > > >>> If filesystems are using page->private for some other purpose also, then
> > > >>> I guess we have issues. 
> > > >>> 
> > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying
> > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer
> > > >>> head for tracking which group originally dirtied the page in IO controller
> > > >>> during writeback.
> > > >> 
> > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> > > >> for other purposes also.
> > > >> 
> > > >> I was under the impression that either page->private is null or it 
> > > >> points to buffer heads for the writeback case. So storing the info
> > > >> directly in either buffer head directly or first in page->private and
> > > >> then transferring it to buffer heads would have helped. 
> > > > 
> > > > Right, btrfs has its own uses for page->private, and we expect to own
> > > > it.  With a proper callback, the FS could store the extra information you
> > > > need in out own structs.
> > > 
> > > There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all).  Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken.
> > 
> > Andreas,
> > 
> > As Chris mentioned, will providing callbacks so that filesystem can
> > save/restore page->private be reasonable?
> 
> Just to clarify, I think saving/restoring page->private is going to be
> hard.  I'd rather just have a call back that says here's a page, storage
> this for the block io controller please, and another one that returns
> any previously stored info.
> 

Hmm, Vivek,
for dynamic allocation of io-record, how about this kind of tagging ?
(just an idea. not compiled at all.)

Pros.
- much better than consuming 2bytes for all pages including pages other 
  than file caches. 
- this will allow lockless lookup of iotag.
- setting iotag can be done at the same time PAGECACHE_TAG_DIRTY...
  no extra lock will be required.
- At clearing, we can expect lock for radix-tree is already held.
Cons.
- makes radix-tree struct larger and not good for cacheline.
- some special care? will be required at page-migration.


==
@@ -51,6 +51,9 @@ struct radix_tree_node {
        struct rcu_head rcu_head;
        void __rcu      *slots[RADIX_TREE_MAP_SIZE];
        unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+#ifdef CONFIG_BLK_CGROUP
+       unsigned short  iotag[RADIX_TREE_MAP_SIZE];
+#endif
 };

 struct radix_tree_path {
@@ -487,6 +490,36 @@ void *radix_tree_tag_set(struct radix_tr
 }
 EXPORT_SYMBOL(radix_tree_tag_set);

+#ifdef CONFIG_BLK_CGROUP
+void *radix_tree_iotag_set(struct radix_tree_root *root,
+                       unsigned long index, unsigned short tag)
+{
+       unsigned int height, shift;
+       struct radix_tree_node *node;
+
+       height = root->height;
+       BUG_ON(index > radix_tree_maxindex(height));
+
+       node = indirect_to_ptr(root->rnode);
+       shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+
+       while (height > 0) {
+               int offset;
+
+               offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+               node = node->slots[offset];
+               BUG(!node);
+               shift -= RADIX_TREE_MAP_SHIFT;
+               height--;
+       }
+       node->iotag[offset] = tag;
+
+       return;
+}
+EXPORT_SYMBOL(radix_tree_iotag_set);
+


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-10 21:43                   ` Chris Mason
  2011-03-11  1:20                     ` KAMEZAWA Hiroyuki
@ 2011-03-11  1:46                     ` Dave Chinner
  2011-03-11  2:15                       ` Vivek Goyal
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2011-03-11  1:46 UTC (permalink / raw)
  To: Chris Mason
  Cc: Vivek Goyal, Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki,
	m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel

On Thu, Mar 10, 2011 at 04:43:31PM -0500, Chris Mason wrote:
> Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500:
> > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote:
> > > On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> > > >>>>> I think the person who dirtied the page can store the information in
> > > >>>>> page->private (assuming buffer heads were not generated) and if flusher
> > > >>>>> thread later ends up generating buffer heads and ends up modifying
> > > >>>>> page->private, this can be copied in buffer heads?
> > > >>>> 
> > > >>>> This scares me a bit.
> > > >>>> 
> > > >>>> As I understand it, fs/ code expects total ownership of page->private.
> > > >>>> This adds a responsibility for every user to copy the data through and
> > > >>>> store it in the buffer head (or anything else). btrfs seems to do
> > > >>>> something entirely different in some cases and store a different kind
> > > >>>> of value.
> > > >>> 
> > > >>> If filesystems are using page->private for some other purpose also, then
> > > >>> I guess we have issues. 
> > > >>> 
> > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying
> > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer
> > > >>> head for tracking which group originally dirtied the page in IO controller
> > > >>> during writeback.
> > > >> 
> > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> > > >> for other purposes also.
> > > >> 
> > > >> I was under the impression that either page->private is null or it 
> > > >> points to buffer heads for the writeback case. So storing the info
> > > >> directly in either buffer head directly or first in page->private and
> > > >> then transferring it to buffer heads would have helped. 
> > > > 
> > > > Right, btrfs has its own uses for page->private, and we expect to own
> > > > it.  With a proper callback, the FS could store the extra information you
> > > > need in out own structs.
> > > 
> > > There is no requirement that page->private ever points to a
> > > buffer_head, and Lustre clients use it for its own tracking
> > > structure (never touching buffer_heads at all).  Any
> > > assumption about what a filesystem is storing in page->private
> > > in other parts of the code is just broken.
> > 
> > Andreas,
> > 
> > As Chris mentioned, will providing callbacks so that filesystem
> > can save/restore page->private be reasonable?
> 
> Just to clarify, I think saving/restoring page->private is going
> to be hard.  I'd rather just have a call back that says here's a
> page, storage this for the block io controller please, and another
> one that returns any previously stored info.

Agreed - there is absolutely no guarantee that some other thread
doesn't grab the page while it is under writeback and dereference
page->private expecting there to be buffer heads or some filesystem
specific structure to be there. Hence swapping out the expected
structure with something different is problematic.

However, I think there's bigger issues. e.g.  page->private might
point to multiple bufferheads that map to non-contiguous disk blocks
that were written by different threads - what happens if we get two
concurrent IOs to the one page, perhaps with different cgroup IDs?

Further, page->private might not even point to a per-page specific
structure - it might point to a structure shared by multiple pages
(e.g. an extent map). Adding a callback like this requires
filesystems to be able to store per-page or per-block information
for external users. Indeed, one of the areas of development in XFS
right now is to move away from storing internal per-block/per-page
information because of the memory overhead it causes.

IMO, if you really need some per-page information, then just put it
in the struct page - you can't hide the memory overhead just by
having the filesystem to store it for you. That just adds
unnecessary complexity...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-11  1:46                     ` Dave Chinner
@ 2011-03-11  2:15                       ` Vivek Goyal
  2011-03-11  2:52                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-11  2:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki,
	m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel

On Fri, Mar 11, 2011 at 12:46:18PM +1100, Dave Chinner wrote:
> On Thu, Mar 10, 2011 at 04:43:31PM -0500, Chris Mason wrote:
> > Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500:
> > > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote:
> > > > On 2011-03-10, at 2:15 PM, Chris Mason wrote:
> > > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500:
> > > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote:
> > > > >>>>> I think the person who dirtied the page can store the information in
> > > > >>>>> page->private (assuming buffer heads were not generated) and if flusher
> > > > >>>>> thread later ends up generating buffer heads and ends up modifying
> > > > >>>>> page->private, this can be copied in buffer heads?
> > > > >>>> 
> > > > >>>> This scares me a bit.
> > > > >>>> 
> > > > >>>> As I understand it, fs/ code expects total ownership of page->private.
> > > > >>>> This adds a responsibility for every user to copy the data through and
> > > > >>>> store it in the buffer head (or anything else). btrfs seems to do
> > > > >>>> something entirely different in some cases and store a different kind
> > > > >>>> of value.
> > > > >>> 
> > > > >>> If filesystems are using page->private for some other purpose also, then
> > > > >>> I guess we have issues. 
> > > > >>> 
> > > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying
> > > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer
> > > > >>> head for tracking which group originally dirtied the page in IO controller
> > > > >>> during writeback.
> > > > >> 
> > > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private
> > > > >> for other purposes also.
> > > > >> 
> > > > >> I was under the impression that either page->private is null or it 
> > > > >> points to buffer heads for the writeback case. So storing the info
> > > > >> directly in either buffer head directly or first in page->private and
> > > > >> then transferring it to buffer heads would have helped. 
> > > > > 
> > > > > Right, btrfs has its own uses for page->private, and we expect to own
> > > > > it.  With a proper callback, the FS could store the extra information you
> > > > > need in out own structs.
> > > > 
> > > > There is no requirement that page->private ever points to a
> > > > buffer_head, and Lustre clients use it for its own tracking
> > > > structure (never touching buffer_heads at all).  Any
> > > > assumption about what a filesystem is storing in page->private
> > > > in other parts of the code is just broken.
> > > 
> > > Andreas,
> > > 
> > > As Chris mentioned, will providing callbacks so that filesystem
> > > can save/restore page->private be reasonable?
> > 
> > Just to clarify, I think saving/restoring page->private is going
> > to be hard.  I'd rather just have a call back that says here's a
> > page, storage this for the block io controller please, and another
> > one that returns any previously stored info.
> 
> Agreed - there is absolutely no guarantee that some other thread
> doesn't grab the page while it is under writeback and dereference
> page->private expecting there to be buffer heads or some filesystem
> specific structure to be there. Hence swapping out the expected
> structure with something different is problematic.
> 
> However, I think there's bigger issues. e.g.  page->private might
> point to multiple bufferheads that map to non-contiguous disk blocks
> that were written by different threads - what happens if we get two
> concurrent IOs to the one page, perhaps with different cgroup IDs?

I guess in such cases we can afford to lose some accuracy and a 
simple approximation can be the last writer's cgroup id is used
for whole page.
 
> 
> Further, page->private might not even point to a per-page specific
> structure - it might point to a structure shared by multiple pages
> (e.g. an extent map). Adding a callback like this requires
> filesystems to be able to store per-page or per-block information
> for external users. Indeed, one of the areas of development in XFS
> right now is to move away from storing internal per-block/per-page
> information because of the memory overhead it causes.

Ok, if filesystem is trying to move away from per page information then
these kind of callbacks become a burden.

> 
> IMO, if you really need some per-page information, then just put it
> in the struct page - you can't hide the memory overhead just by
> having the filesystem to store it for you. That just adds
> unnecessary complexity...

Ok. I guess adding anything to struct page is going to be hard and 
we might have to fall back to looking into using page_cgroup for
tracking page state. I was trying to explore the ways so that we don't
have to instantiate whole page_cgroup structure just for trying
to figure out who dirtied the page.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-09 18:04     ` Justin TerAvest
@ 2011-03-11  2:47       ` Vivek Goyal
  2011-03-11 16:07         ` Justin TerAvest
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-11  2:47 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Wed, Mar 09, 2011 at 10:04:11AM -0800, Justin TerAvest wrote:
> On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
> >> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
> >> > This patchset adds tracking to the page_cgroup structure for which cgroup has
> >> > dirtied a page, and uses that information to provide isolation between
> >> > cgroups performing writeback.
> >> >
> >>
> >> Justin,
> >>
> >> So if somebody is trying to isolate a workload which does bunch of READS
> >> and lots of buffered WRITES, this patchset should help in the sense that
> >> all the heavy WRITES can be put into a separate cgroup of low weight?
> >>
> >> Other application which are primarily doing READS, direct WRITES or little
> >> bit of buffered WRITES should still get good latencies if heavy writer
> >> is isolated in a separate group?
> >>
> >> If yes, then this piece standalone can make sense. And once the other
> >> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
> >> writeout come in, then one will be able to differentiate buffered writes
> >> of different groups.
> >
> > Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
> > question would be will it help me enable get better isolation latencies
> > of READS agains buffered WRITES?
> 
> Ah! Sorry, I left out a patch that disables cross-group preemption.
> I'll add that to the patchset and email out v2 soon.

Well, what I was referring to that even in current code sync preempts
all async in CFQ. So it looks like this patchset will not help get
better latencies in presence of WRITES?

The only place it can help is that one is looking for service differentation
between two or more buffered write streams. For that we need to fix
upper layers first.

Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-11  2:15                       ` Vivek Goyal
@ 2011-03-11  2:52                         ` KAMEZAWA Hiroyuki
  2011-03-11  3:15                           ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-11  2:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest,
	m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel

On Thu, 10 Mar 2011 21:15:31 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:
 
> > IMO, if you really need some per-page information, then just put it
> > in the struct page - you can't hide the memory overhead just by
> > having the filesystem to store it for you. That just adds
> > unnecessary complexity...
> 
> Ok. I guess adding anything to struct page is going to be hard and 
> we might have to fall back to looking into using page_cgroup for
> tracking page state. I was trying to explore the ways so that we don't
> have to instantiate whole page_cgroup structure just for trying
> to figure out who dirtied the page.
> 

Is this bad ?
==

At handling ASYNC I/O in blkio layer, it's unknown that who dirtied the page.
This lack of information makes impossible to throttole Async I/O per
cgroup in blkio queue layer.

This patch records the information into radix-tree and preserve the
information. There is no 'clear' operation because all I/O starts when
the page is marked as DIRTY.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/radix-tree.h |    3 +++
 lib/radix-tree.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

Index: mmotm-Mar10/include/linux/radix-tree.h
===================================================================
--- mmotm-Mar10.orig/include/linux/radix-tree.h
+++ mmotm-Mar10/include/linux/radix-tree.h
@@ -58,12 +58,14 @@ struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	__rcu *rnode;
+	int			iohint;
 };
 
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
+	.iohint = 0,							\
 }
 
 #define RADIX_TREE(name, mask) \
@@ -74,6 +76,7 @@ do {									\
 	(root)->height = 0;						\
 	(root)->gfp_mask = (mask);					\
 	(root)->rnode = NULL;						\
+	(root)->iohint = 0;						\
 } while (0)
 
 /**
Index: mmotm-Mar10/lib/radix-tree.c
===================================================================
--- mmotm-Mar10.orig/lib/radix-tree.c
+++ mmotm-Mar10/lib/radix-tree.c
@@ -31,6 +31,7 @@
 #include <linux/string.h>
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
+#include <linux/blkdev.h>
 
 
 #ifdef __KERNEL__
@@ -51,6 +52,9 @@ struct radix_tree_node {
 	struct rcu_head	rcu_head;
 	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+#ifdef CONFIG_BLK_CGROUP
+	unsigned short  iohint[RADIX_TREE_MAP_SIZE];
+#endif
 };
 
 struct radix_tree_path {
@@ -473,6 +477,8 @@ void *radix_tree_tag_set(struct radix_tr
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
+		if (height == 1 && slot && tag == PAGECACHE_TAG_DIRTY)
+			blkio_record_hint(&slot->iohint[offset]);
 		slot = slot->slots[offset];
 		BUG_ON(slot == NULL);
 		shift -= RADIX_TREE_MAP_SHIFT;
@@ -1418,3 +1425,38 @@ void __init radix_tree_init(void)
 	radix_tree_init_maxindex();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
+
+#ifdef CONFIG_BLK_CGROUP
+
+unsigned short radix_tree_lookup_iohint(struct radix_tree_root *root,
+				int index)
+{
+	unsigned int height, shift;
+	struct radix_tree_node *node;
+
+	node = rcu_redereference(root->rnode);
+	if (node == NULL)
+		return 0;
+	if (!radix_tree_is_indirect_ptr(node))
+		return root->iohint;
+	node = radxi_tree_indirect_to_ptr(node);
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return 0;
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	for ( ; ; ) {
+		int offset;
+
+		if (node == NULL)
+			return 0;
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		if (height == 1)
+			return node->iohint[offset];
+		node = rcu_rereference(node->slots[offset]);
+		shift -= RADIX_TREE_MAP_SHIFT;
+		height--;
+	}
+}
+EXPORT_SYMBOL(radix_tree_lookup_iohint);
+#endif


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-11  3:15                           ` Vivek Goyal
@ 2011-03-11  3:13                             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-11  3:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest,
	m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel

On Thu, 10 Mar 2011 22:15:04 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Fri, Mar 11, 2011 at 11:52:35AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 10 Mar 2011 21:15:31 -0500
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >  
> > > > IMO, if you really need some per-page information, then just put it
> > > > in the struct page - you can't hide the memory overhead just by
> > > > having the filesystem to store it for you. That just adds
> > > > unnecessary complexity...
> > > 
> > > Ok. I guess adding anything to struct page is going to be hard and 
> > > we might have to fall back to looking into using page_cgroup for
> > > tracking page state. I was trying to explore the ways so that we don't
> > > have to instantiate whole page_cgroup structure just for trying
> > > to figure out who dirtied the page.
> > > 
> > 
> > Is this bad ?
> > ==
> 
> Sounds like an interesting idea. I am primarily concered about the radix
> tree node size increase. Not sure how big a concern this is.
> 
> Also tracking is useful for two things.
> 
> - Proportinal IO
> - IO throttling
> 
> For proportional IO, anyway we have to use it with memory controller to
> control per cgroup dirty ratio so storing info in page_cgroup should
> not hurt. 
> 

dirty-ratio for memcg will be implemented. It's definitely necessary.

> The only other case where dependence on page_cgroup hurts is IO throttling
> where IO controller does not really need memory cgroup controller (I hope
> so). But we are still not sure if throttling IO at device level is a
> good idea and how to resolve issues related to priority inversion.
> 
Yes, that's priority inversion is my concern.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.)
  2011-03-11  2:52                         ` KAMEZAWA Hiroyuki
@ 2011-03-11  3:15                           ` Vivek Goyal
  2011-03-11  3:13                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-11  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest,
	m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea,
	guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel

On Fri, Mar 11, 2011 at 11:52:35AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 10 Mar 2011 21:15:31 -0500
> Vivek Goyal <vgoyal@redhat.com> wrote:
>  
> > > IMO, if you really need some per-page information, then just put it
> > > in the struct page - you can't hide the memory overhead just by
> > > having the filesystem to store it for you. That just adds
> > > unnecessary complexity...
> > 
> > Ok. I guess adding anything to struct page is going to be hard and 
> > we might have to fall back to looking into using page_cgroup for
> > tracking page state. I was trying to explore the ways so that we don't
> > have to instantiate whole page_cgroup structure just for trying
> > to figure out who dirtied the page.
> > 
> 
> Is this bad ?
> ==

Sounds like an interesting idea. I am primarily concered about the radix
tree node size increase. Not sure how big a concern this is.

Also tracking is useful for two things.

- Proportinal IO
- IO throttling

For proportional IO, anyway we have to use it with memory controller to
control per cgroup dirty ratio so storing info in page_cgroup should
not hurt. 

The only other case where dependence on page_cgroup hurts is IO throttling
where IO controller does not really need memory cgroup controller (I hope
so). But we are still not sure if throttling IO at device level is a
good idea and how to resolve issues related to priority inversion.

But this definitely sounds better than adding a new field in page
struct as I am assuming that it overall is going to consume less memory.

Thanks
Vivek 

> 
> At handling ASYNC I/O in blkio layer, it's unknown that who dirtied the page.
> This lack of information makes impossible to throttole Async I/O per
> cgroup in blkio queue layer.
> 
> This patch records the information into radix-tree and preserve the
> information. There is no 'clear' operation because all I/O starts when
> the page is marked as DIRTY.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/radix-tree.h |    3 +++
>  lib/radix-tree.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
> 
> Index: mmotm-Mar10/include/linux/radix-tree.h
> ===================================================================
> --- mmotm-Mar10.orig/include/linux/radix-tree.h
> +++ mmotm-Mar10/include/linux/radix-tree.h
> @@ -58,12 +58,14 @@ struct radix_tree_root {
>  	unsigned int		height;
>  	gfp_t			gfp_mask;
>  	struct radix_tree_node	__rcu *rnode;
> +	int			iohint;
>  };
>  
>  #define RADIX_TREE_INIT(mask)	{					\
>  	.height = 0,							\
>  	.gfp_mask = (mask),						\
>  	.rnode = NULL,							\
> +	.iohint = 0,							\
>  }
>  
>  #define RADIX_TREE(name, mask) \
> @@ -74,6 +76,7 @@ do {									\
>  	(root)->height = 0;						\
>  	(root)->gfp_mask = (mask);					\
>  	(root)->rnode = NULL;						\
> +	(root)->iohint = 0;						\
>  } while (0)
>  
>  /**
> Index: mmotm-Mar10/lib/radix-tree.c
> ===================================================================
> --- mmotm-Mar10.orig/lib/radix-tree.c
> +++ mmotm-Mar10/lib/radix-tree.c
> @@ -31,6 +31,7 @@
>  #include <linux/string.h>
>  #include <linux/bitops.h>
>  #include <linux/rcupdate.h>
> +#include <linux/blkdev.h>
>  
>  
>  #ifdef __KERNEL__
> @@ -51,6 +52,9 @@ struct radix_tree_node {
>  	struct rcu_head	rcu_head;
>  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
>  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> +#ifdef CONFIG_BLK_CGROUP
> +	unsigned short  iohint[RADIX_TREE_MAP_SIZE];
> +#endif
>  };
>  
>  struct radix_tree_path {
> @@ -473,6 +477,8 @@ void *radix_tree_tag_set(struct radix_tr
>  		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
>  		if (!tag_get(slot, tag, offset))
>  			tag_set(slot, tag, offset);
> +		if (height == 1 && slot && tag == PAGECACHE_TAG_DIRTY)
> +			blkio_record_hint(&slot->iohint[offset]);
>  		slot = slot->slots[offset];
>  		BUG_ON(slot == NULL);
>  		shift -= RADIX_TREE_MAP_SHIFT;
> @@ -1418,3 +1425,38 @@ void __init radix_tree_init(void)
>  	radix_tree_init_maxindex();
>  	hotcpu_notifier(radix_tree_callback, 0);
>  }
> +
> +#ifdef CONFIG_BLK_CGROUP
> +
> +unsigned short radix_tree_lookup_iohint(struct radix_tree_root *root,
> +				int index)
> +{
> +	unsigned int height, shift;
> +	struct radix_tree_node *node;
> +
> +	node = rcu_redereference(root->rnode);
> +	if (node == NULL)
> +		return 0;
> +	if (!radix_tree_is_indirect_ptr(node))
> +		return root->iohint;
> +	node = radxi_tree_indirect_to_ptr(node);
> +
> +	height = node->height;
> +	if (index > radix_tree_maxindex(height))
> +		return 0;
> +	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> +	for ( ; ; ) {
> +		int offset;
> +
> +		if (node == NULL)
> +			return 0;
> +		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
> +		if (height == 1)
> +			return node->iohint[offset];
> +		node = rcu_rereference(node->slots[offset]);
> +		shift -= RADIX_TREE_MAP_SHIFT;
> +		height--;
> +	}
> +}
> +EXPORT_SYMBOL(radix_tree_lookup_iohint);
> +#endif

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-11  2:47       ` Vivek Goyal
@ 2011-03-11 16:07         ` Justin TerAvest
  2011-03-11 16:39           ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Justin TerAvest @ 2011-03-11 16:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Thu, Mar 10, 2011 at 6:47 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Mar 09, 2011 at 10:04:11AM -0800, Justin TerAvest wrote:
>> On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
>> >> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
>> >> > This patchset adds tracking to the page_cgroup structure for which cgroup has
>> >> > dirtied a page, and uses that information to provide isolation between
>> >> > cgroups performing writeback.
>> >> >
>> >>
>> >> Justin,
>> >>
>> >> So if somebody is trying to isolate a workload which does bunch of READS
>> >> and lots of buffered WRITES, this patchset should help in the sense that
>> >> all the heavy WRITES can be put into a separate cgroup of low weight?
>> >>
>> >> Other application which are primarily doing READS, direct WRITES or little
>> >> bit of buffered WRITES should still get good latencies if heavy writer
>> >> is isolated in a separate group?
>> >>
>> >> If yes, then this piece standalone can make sense. And once the other
>> >> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
>> >> writeout come in, then one will be able to differentiate buffered writes
>> >> of different groups.
>> >
>> > Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
>> > question would be will it help me enable get better isolation latencies
>> > of READS agains buffered WRITES?
>>
>> Ah! Sorry, I left out a patch that disables cross-group preemption.
>> I'll add that to the patchset and email out v2 soon.
>
> Well, what I was referring to that even in current code sync preempts
> all async in CFQ. So it looks like this patchset will not help get
> better latencies in presence of WRITES?

Hi Vivek,

I should have been more clear. I forgot to include a patch that
changes the behavior of that preemption. I haven't mailed out v2 yet
because I was also writing a change to put the css_id in pc->flags
instead of its own field.

The preemption change would look like:

    Previously, a sync queue in can preempt an async queue in another
    cgroup. This changes that behavior to disallow such preemption.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ab7a216..0494c0c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3390,6 +3390,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
        if (!cfqq)
                return false;

+       if (new_cfqq->cfqg != cfqq->cfqg)
+               return false;
+
        if (cfq_class_idle(new_cfqq))
                return false;

@@ -3409,9 +3412,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
        if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
                return true;

-       if (new_cfqq->cfqg != cfqq->cfqg)
-               return false;

I will include the test results that show that isolation is also
improved between a reader and a buffered writer.

Thanks,
Justin


>
> The only place it can help is that one is looking for service differentation
> between two or more buffered write streams. For that we need to fix
> upper layers first.
>
> Vivek
>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-11 16:07         ` Justin TerAvest
@ 2011-03-11 16:39           ` Vivek Goyal
  2011-03-15 16:41             ` Justin TerAvest
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2011-03-11 16:39 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Fri, Mar 11, 2011 at 08:07:17AM -0800, Justin TerAvest wrote:
> On Thu, Mar 10, 2011 at 6:47 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Mar 09, 2011 at 10:04:11AM -0800, Justin TerAvest wrote:
> >> On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
> >> >> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
> >> >> > This patchset adds tracking to the page_cgroup structure for which cgroup has
> >> >> > dirtied a page, and uses that information to provide isolation between
> >> >> > cgroups performing writeback.
> >> >> >
> >> >>
> >> >> Justin,
> >> >>
> >> >> So if somebody is trying to isolate a workload which does bunch of READS
> >> >> and lots of buffered WRITES, this patchset should help in the sense that
> >> >> all the heavy WRITES can be put into a separate cgroup of low weight?
> >> >>
> >> >> Other application which are primarily doing READS, direct WRITES or little
> >> >> bit of buffered WRITES should still get good latencies if heavy writer
> >> >> is isolated in a separate group?
> >> >>
> >> >> If yes, then this piece standalone can make sense. And once the other
> >> >> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
> >> >> writeout come in, then one will be able to differentiate buffered writes
> >> >> of different groups.
> >> >
> >> > Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
> >> > question would be will it help me enable get better isolation latencies
> >> > of READS agains buffered WRITES?
> >>
> >> Ah! Sorry, I left out a patch that disables cross-group preemption.
> >> I'll add that to the patchset and email out v2 soon.
> >
> > Well, what I was referring to that even in current code sync preempts
> > all async in CFQ. So it looks like this patchset will not help get
> > better latencies in presence of WRITES?
> 
> Hi Vivek,
> 
> I should have been more clear. I forgot to include a patch that
> changes the behavior of that preemption. I haven't mailed out v2 yet
> because I was also writing a change to put the css_id in pc->flags
> instead of its own field.
> 
> The preemption change would look like:
> 
>     Previously, a sync queue in can preempt an async queue in another
>     cgroup. This changes that behavior to disallow such preemption.
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index ab7a216..0494c0c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3390,6 +3390,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
>         if (!cfqq)
>                 return false;
> 
> +       if (new_cfqq->cfqg != cfqq->cfqg)
> +               return false;
> +
>         if (cfq_class_idle(new_cfqq))
>                 return false;
> 
> @@ -3409,9 +3412,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
>         if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>                 return true;
> 
> -       if (new_cfqq->cfqg != cfqq->cfqg)
> -               return false;
> 
> I will include the test results that show that isolation is also
> improved between a reader and a buffered writer.
> 

Justin,

Right now all the async queues go in one cgroup and that is root cgroup.
So it is perfectly fine to let sync preempt async.

I will be interested in seeing the results. But do you have a theoritical
explanation also that why with this patch set we will get better isolation
between READS and WRITES?

Especially after following patch from Shaohua Li.

commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
Author: Shaohua Li <shaohua.li@intel.com>
Date:   Fri Jan 14 08:41:02 2011 +0100

    block cfq: make queue preempt work for queues from different workload


This patch will make sure that sync always preempt async. So I am not
able to understand that how this patch series provides better latencies
for READS in presence of WRITES.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-11 16:39           ` Vivek Goyal
@ 2011-03-15 16:41             ` Justin TerAvest
  2011-03-15 18:31               ` Vivek Goyal
  0 siblings, 1 reply; 33+ messages in thread
From: Justin TerAvest @ 2011-03-15 16:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Fri, Mar 11, 2011 at 8:39 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Mar 11, 2011 at 08:07:17AM -0800, Justin TerAvest wrote:
>> On Thu, Mar 10, 2011 at 6:47 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Wed, Mar 09, 2011 at 10:04:11AM -0800, Justin TerAvest wrote:
>> >> On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> >> > On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
>> >> >> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
>> >> >> > This patchset adds tracking to the page_cgroup structure for which cgroup has
>> >> >> > dirtied a page, and uses that information to provide isolation between
>> >> >> > cgroups performing writeback.
>> >> >> >
>> >> >>
>> >> >> Justin,
>> >> >>
>> >> >> So if somebody is trying to isolate a workload which does bunch of READS
>> >> >> and lots of buffered WRITES, this patchset should help in the sense that
>> >> >> all the heavy WRITES can be put into a separate cgroup of low weight?
>> >> >>
>> >> >> Other application which are primarily doing READS, direct WRITES or little
>> >> >> bit of buffered WRITES should still get good latencies if heavy writer
>> >> >> is isolated in a separate group?
>> >> >>
>> >> >> If yes, then this piece standalone can make sense. And once the other
>> >> >> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
>> >> >> writeout come in, then one will be able to differentiate buffered writes
>> >> >> of different groups.
>> >> >
>> >> > Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
>> >> > question would be will it help me enable get better isolation latencies
>> >> > of READS agains buffered WRITES?
>> >>
>> >> Ah! Sorry, I left out a patch that disables cross-group preemption.
>> >> I'll add that to the patchset and email out v2 soon.
>> >
>> > Well, what I was referring to that even in current code sync preempts
>> > all async in CFQ. So it looks like this patchset will not help get
>> > better latencies in presence of WRITES?
>>
>> Hi Vivek,
>>
>> I should have been more clear. I forgot to include a patch that
>> changes the behavior of that preemption. I haven't mailed out v2 yet
>> because I was also writing a change to put the css_id in pc->flags
>> instead of its own field.
>>
>> The preemption change would look like:
>>
>>     Previously, a sync queue in can preempt an async queue in another
>>     cgroup. This changes that behavior to disallow such preemption.
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index ab7a216..0494c0c 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -3390,6 +3390,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
>>         if (!cfqq)
>>                 return false;
>>
>> +       if (new_cfqq->cfqg != cfqq->cfqg)
>> +               return false;
>> +
>>         if (cfq_class_idle(new_cfqq))
>>                 return false;
>>
>> @@ -3409,9 +3412,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
>>         if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>>                 return true;
>>
>> -       if (new_cfqq->cfqg != cfqq->cfqg)
>> -               return false;
>>
>> I will include the test results that show that isolation is also
>> improved between a reader and a buffered writer.
>>
>
> Justin,
>
> Right now all the async queues go in one cgroup and that is root cgroup.
> So it is perfectly fine to let sync preempt async.
>
> I will be interested in seeing the results. But do you have a theoritical
> explanation also that why with this patch set we will get better isolation
> between READS and WRITES?

Hi Vivek,

Some of the change is probably because the patch set also changes
preemption semantics. Because now a group will be charged for the
buffered WRITES it is causing, it will give less time than before to a
task that is doing lots of buffered writes. I feel like I am still not
understanding something you are saying, but it will probably make more
sense once I have supporting data.

>
> Especially after following patch from Shaohua Li.
>
> commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
> Author: Shaohua Li <shaohua.li@intel.com>
> Date:   Fri Jan 14 08:41:02 2011 +0100
>
>    block cfq: make queue preempt work for queues from different workload
>
>
> This patch will make sure that sync always preempt async. So I am not
> able to understand that how this patch series provides better latencies
> for READS in presence of WRITES.

I have primarily been measuring isolation between READS and WRITES. Do
you have any tools you'd recommend for getting data on when latencies
are better?


Sorry for the delay in mailing v2; I'm debugging a problem where I see
much better isolation with ext2 than ext4.

Thanks,
Justin

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.
  2011-03-15 16:41             ` Justin TerAvest
@ 2011-03-15 18:31               ` Vivek Goyal
  0 siblings, 0 replies; 33+ messages in thread
From: Vivek Goyal @ 2011-03-15 18:31 UTC (permalink / raw)
  To: Justin TerAvest
  Cc: m-ikeda, jaxboe, linux-kernel, ryov, taka, kamezawa.hiroyu,
	righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin

On Tue, Mar 15, 2011 at 09:41:50AM -0700, Justin TerAvest wrote:
> On Fri, Mar 11, 2011 at 8:39 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Mar 11, 2011 at 08:07:17AM -0800, Justin TerAvest wrote:
> >> On Thu, Mar 10, 2011 at 6:47 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Wed, Mar 09, 2011 at 10:04:11AM -0800, Justin TerAvest wrote:
> >> >> On Tue, Mar 8, 2011 at 2:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> >> > On Tue, Mar 08, 2011 at 05:43:25PM -0500, Vivek Goyal wrote:
> >> >> >> On Tue, Mar 08, 2011 at 01:20:50PM -0800, Justin TerAvest wrote:
> >> >> >> > This patchset adds tracking to the page_cgroup structure for which cgroup has
> >> >> >> > dirtied a page, and uses that information to provide isolation between
> >> >> >> > cgroups performing writeback.
> >> >> >> >
> >> >> >>
> >> >> >> Justin,
> >> >> >>
> >> >> >> So if somebody is trying to isolate a workload which does bunch of READS
> >> >> >> and lots of buffered WRITES, this patchset should help in the sense that
> >> >> >> all the heavy WRITES can be put into a separate cgroup of low weight?
> >> >> >>
> >> >> >> Other application which are primarily doing READS, direct WRITES or little
> >> >> >> bit of buffered WRITES should still get good latencies if heavy writer
> >> >> >> is isolated in a separate group?
> >> >> >>
> >> >> >> If yes, then this piece standalone can make sense. And once the other
> >> >> >> piece/patches of memory cgroup dirty ratio and cgroup aware buffered
> >> >> >> writeout come in, then one will be able to differentiate buffered writes
> >> >> >> of different groups.
> >> >> >
> >> >> > Thinking more about it, currently anyway SYNC preempts the ASYNC. So the
> >> >> > question would be will it help me enable get better isolation latencies
> >> >> > of READS agains buffered WRITES?
> >> >>
> >> >> Ah! Sorry, I left out a patch that disables cross-group preemption.
> >> >> I'll add that to the patchset and email out v2 soon.
> >> >
> >> > Well, what I was referring to that even in current code sync preempts
> >> > all async in CFQ. So it looks like this patchset will not help get
> >> > better latencies in presence of WRITES?
> >>
> >> Hi Vivek,
> >>
> >> I should have been more clear. I forgot to include a patch that
> >> changes the behavior of that preemption. I haven't mailed out v2 yet
> >> because I was also writing a change to put the css_id in pc->flags
> >> instead of its own field.
> >>
> >> The preemption change would look like:
> >>
> >>     Previously, a sync queue in can preempt an async queue in another
> >>     cgroup. This changes that behavior to disallow such preemption.
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index ab7a216..0494c0c 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -3390,6 +3390,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
> >>         if (!cfqq)
> >>                 return false;
> >>
> >> +       if (new_cfqq->cfqg != cfqq->cfqg)
> >> +               return false;
> >> +
> >>         if (cfq_class_idle(new_cfqq))
> >>                 return false;
> >>
> >> @@ -3409,9 +3412,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue
> >>         if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> >>                 return true;
> >>
> >> -       if (new_cfqq->cfqg != cfqq->cfqg)
> >> -               return false;
> >>
> >> I will include the test results that show that isolation is also
> >> improved between a reader and a buffered writer.
> >>
> >
> > Justin,
> >
> > Right now all the async queues go in one cgroup and that is root cgroup.
> > So it is perfectly fine to let sync preempt async.
> >
> > I will be interested in seeing the results. But do you have a theoritical
> > explanation also that why with this patch set we will get better isolation
> > between READS and WRITES?
> 
> Hi Vivek,
> 
> Some of the change is probably because the patch set also changes
> preemption semantics. Because now a group will be charged for the
> buffered WRITES it is causing, it will give less time than before to a
> task that is doing lots of buffered writes. I feel like I am still not
> understanding something you are saying, but it will probably make more
> sense once I have supporting data.
> 
> >
> > Especially after following patch from Shaohua Li.
> >
> > commit f8ae6e3eb8251be32c6e913393d9f8d9e0609489
> > Author: Shaohua Li <shaohua.li@intel.com>
> > Date:   Fri Jan 14 08:41:02 2011 +0100
> >
> >    block cfq: make queue preempt work for queues from different workload
> >
> >
> > This patch will make sure that sync always preempt async. So I am not
> > able to understand that how this patch series provides better latencies
> > for READS in presence of WRITES.
> 
> I have primarily been measuring isolation between READS and WRITES. Do
> you have any tools you'd recommend for getting data on when latencies
> are better?

I like fio. Just launch bunch of random readers and start few writebacks
and notice max completion latency of reads. We can compare the results
with Jen's tree vs your changes on top.

Keep all READS and WRITES in root group (without your patches). With your
patches move WRITES in a separate cgroup and see if you can get better
latencies for READS. So this is WRITES being root group vs WRITES being
in a separate cgroup of low weight.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-03-15 18:31 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-08 21:20 [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Justin TerAvest
2011-03-08 21:20 ` [PATCH 1/6] Add IO cgroup tracking " Justin TerAvest
2011-03-08 21:20 ` [PATCH 2/6] Make async queues per cgroup Justin TerAvest
2011-03-08 21:20 ` [PATCH 3/6] Modify CFQ to use IO tracking information Justin TerAvest
2011-03-08 21:20 ` [PATCH 4/6] With per-cgroup async, don't special case queues Justin TerAvest
2011-03-08 21:20 ` [PATCH 5/6] Add stat for per cgroup writeout done by flusher Justin TerAvest
2011-03-08 21:20 ` [PATCH 6/6] Per cgroup request descriptor counts Justin TerAvest
2011-03-08 22:43 ` [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes Vivek Goyal
2011-03-08 22:50   ` Vivek Goyal
2011-03-09 18:04     ` Justin TerAvest
2011-03-11  2:47       ` Vivek Goyal
2011-03-11 16:07         ` Justin TerAvest
2011-03-11 16:39           ` Vivek Goyal
2011-03-15 16:41             ` Justin TerAvest
2011-03-15 18:31               ` Vivek Goyal
2011-03-09  5:22 ` KAMEZAWA Hiroyuki
2011-03-09 15:19   ` Vivek Goyal
2011-03-09 18:05   ` Justin TerAvest
2011-03-10 18:08   ` Justin TerAvest
2011-03-10 18:15     ` Vivek Goyal
2011-03-10 18:57       ` Justin TerAvest
2011-03-10 19:11         ` [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) Vivek Goyal
2011-03-10 19:41           ` Vivek Goyal
2011-03-10 21:15             ` Chris Mason
2011-03-10 21:24               ` Andreas Dilger
2011-03-10 21:38                 ` Vivek Goyal
2011-03-10 21:43                   ` Chris Mason
2011-03-11  1:20                     ` KAMEZAWA Hiroyuki
2011-03-11  1:46                     ` Dave Chinner
2011-03-11  2:15                       ` Vivek Goyal
2011-03-11  2:52                         ` KAMEZAWA Hiroyuki
2011-03-11  3:15                           ` Vivek Goyal
2011-03-11  3:13                             ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).