All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] block: implement per-blkg request allocation
@ 2012-04-26 21:59 ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hello,

Currently block layer shares a single request_list (@q->rq) for all
IOs regardless of their blkcg associations.  This means that once the
shared pool is exhausted, blkcg limits don't mean much.  Whoever grabs
the requests being freed the first grabs the next IO slot.

This priority inversion can be easily demonstrated by creating a blkio
cgroup w/ very low weight, put a program which can issue a lot of
random direct IOs there and running a sequential IO from a different
cgroup.  As soon as the request pool is used up, the sequential IO
bandwidth crashes.

This patchset implements per-blkg request allocation so that each
blkcg-request_queue pair has its own request pool to allocate from.
This isolates different blkcgs in terms of request allocation.

Most changes are straight-forward; unfortunately, bdi isn't
blkcg-aware yet so it currently just propagates the congestion state
from root blkcg.  As writeback currently is always on the root blkcg,
this kinda works for write congestion but readahead may behave
non-optimally under congestion for now.  This needs to be improved but
the situation is still way better than blkcg completely collapsing.

 0001-blkcg-fix-blkg_alloc-failure-path.patch
 0002-blkcg-__blkg_lookup_create-doesn-t-have-to-fail-on-r.patch
 0003-blkcg-make-root-blkcg-allocation-use-GFP_KERNEL.patch
 0004-mempool-add-gfp_mask-to-mempool_create_node.patch
 0005-block-drop-custom-queue-draining-used-by-scsi_transp.patch
 0006-block-refactor-get_request-_wait.patch
 0007-block-allocate-io_context-upfront.patch
 0008-blkcg-inline-bio_blkcg-and-friends.patch
 0009-block-add-q-nr_rqs-and-move-q-rq.elvpriv-to-q-nr_rqs.patch
 0010-block-prepare-for-multiple-request_lists.patch
 0011-blkcg-implement-per-blkg-request-allocation.patch

0001-0003 are assorted fixes / improvements which can be separated
from this patchset.  Just sending as part of this series for
convenience.

0004 adds @gfp_mask to mempool_create_node().  This is necessary
because blkg allocation is on the IO path and now blkg contains
mempool for request_list.  Note that blkg allocation failure doesn't
lead to catastrophic failure.  It just hinders blkcg enforcement.

0005 drops custom queue draining which I dont't think is necessary and
hinders with further updates.

0006-0010 are prep patches and 0011 implements per-blkg request
allocation.

This patchset is on top of,

  block/for-3.5/core bd1a68b59c "vmsplice: relax alignement requireme..."
+ [1] blkcg: tg_stats_alloc_lock is an irq lock

and is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blkcg-rl

Jens, I still can't reproduce the boot failure you were seeing on
block/for-3.5/core, so am just basing this series on top.  Once we
figure that one out, we can resequence the patches.

Thanks.

 block/blk-cgroup.c                  |  147 ++++++++++++++++----------
 block/blk-cgroup.h                  |  121 +++++++++++++++++++++
 block/blk-core.c                    |  200 ++++++++++++++++++------------------
 block/blk-sysfs.c                   |   34 +++---
 block/blk-throttle.c                |    3 
 block/blk.h                         |    3 
 block/bsg-lib.c                     |   53 ---------
 drivers/scsi/scsi_transport_fc.c    |   38 ------
 drivers/scsi/scsi_transport_iscsi.c |    2 
 include/linux/blkdev.h              |   53 +++++----
 include/linux/bsg-lib.h             |    1 
 include/linux/mempool.h             |    3 
 mm/mempool.c                        |   12 +-
 13 files changed, 379 insertions(+), 291 deletions(-)

--
tejun

[1] http://article.gmane.org/gmane.linux.kernel/1288400

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCHSET] block: implement per-blkg request allocation
@ 2012-04-26 21:59 ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Hello,

Currently block layer shares a single request_list (@q->rq) for all
IOs regardless of their blkcg associations.  This means that once the
shared pool is exhausted, blkcg limits don't mean much.  Whoever grabs
the requests being freed the first grabs the next IO slot.

This priority inversion can be easily demonstrated by creating a blkio
cgroup w/ very low weight, put a program which can issue a lot of
random direct IOs there and running a sequential IO from a different
cgroup.  As soon as the request pool is used up, the sequential IO
bandwidth crashes.

This patchset implements per-blkg request allocation so that each
blkcg-request_queue pair has its own request pool to allocate from.
This isolates different blkcgs in terms of request allocation.

Most changes are straight-forward; unfortunately, bdi isn't
blkcg-aware yet so it currently just propagates the congestion state
from root blkcg.  As writeback currently is always on the root blkcg,
this kinda works for write congestion but readahead may behave
non-optimally under congestion for now.  This needs to be improved but
the situation is still way better than blkcg completely collapsing.

 0001-blkcg-fix-blkg_alloc-failure-path.patch
 0002-blkcg-__blkg_lookup_create-doesn-t-have-to-fail-on-r.patch
 0003-blkcg-make-root-blkcg-allocation-use-GFP_KERNEL.patch
 0004-mempool-add-gfp_mask-to-mempool_create_node.patch
 0005-block-drop-custom-queue-draining-used-by-scsi_transp.patch
 0006-block-refactor-get_request-_wait.patch
 0007-block-allocate-io_context-upfront.patch
 0008-blkcg-inline-bio_blkcg-and-friends.patch
 0009-block-add-q-nr_rqs-and-move-q-rq.elvpriv-to-q-nr_rqs.patch
 0010-block-prepare-for-multiple-request_lists.patch
 0011-blkcg-implement-per-blkg-request-allocation.patch

0001-0003 are assorted fixes / improvements which can be separated
from this patchset.  Just sending as part of this series for
convenience.

0004 adds @gfp_mask to mempool_create_node().  This is necessary
because blkg allocation is on the IO path and now blkg contains
mempool for request_list.  Note that blkg allocation failure doesn't
lead to catastrophic failure.  It just hinders blkcg enforcement.

0005 drops custom queue draining which I dont't think is necessary and
hinders with further updates.

0006-0010 are prep patches and 0011 implements per-blkg request
allocation.

This patchset is on top of,

  block/for-3.5/core bd1a68b59c "vmsplice: relax alignement requireme..."
+ [1] blkcg: tg_stats_alloc_lock is an irq lock

and is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blkcg-rl

Jens, I still can't reproduce the boot failure you were seeing on
block/for-3.5/core, so am just basing this series on top.  Once we
figure that one out, we can resequence the patches.

Thanks.

 block/blk-cgroup.c                  |  147 ++++++++++++++++----------
 block/blk-cgroup.h                  |  121 +++++++++++++++++++++
 block/blk-core.c                    |  200 ++++++++++++++++++------------------
 block/blk-sysfs.c                   |   34 +++---
 block/blk-throttle.c                |    3 
 block/blk.h                         |    3 
 block/bsg-lib.c                     |   53 ---------
 drivers/scsi/scsi_transport_fc.c    |   38 ------
 drivers/scsi/scsi_transport_iscsi.c |    2 
 include/linux/blkdev.h              |   53 +++++----
 include/linux/bsg-lib.h             |    1 
 include/linux/mempool.h             |    3 
 mm/mempool.c                        |   12 +-
 13 files changed, 379 insertions(+), 291 deletions(-)

--
tejun

[1] http://article.gmane.org/gmane.linux.kernel/1288400

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 01/11] blkcg: fix blkg_alloc() failure path
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

When policy data allocation fails in the middle, blkg_alloc() invokes
blkg_free() to destroy the half constructed blkg.  This ends up
calling pd_exit_fn() on policy datas which didn't go through
pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
immediately after each policy data allocation.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02cf633..4ab7420 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-	}
-
-	/* invoke per-policy init */
-	for (i = 0; i < BLKCG_MAX_POLS; i++) {
-		struct blkcg_policy *pol = blkcg_policy[i];
 
+		/* invoke per-policy init */
 		if (blkcg_policy_enabled(blkg->q, pol))
 			pol->pd_init_fn(blkg);
 	}
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 01/11] blkcg: fix blkg_alloc() failure path
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

When policy data allocation fails in the middle, blkg_alloc() invokes
blkg_free() to destroy the half constructed blkg.  This ends up
calling pd_exit_fn() on policy datas which didn't go through
pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
immediately after each policy data allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02cf633..4ab7420 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-	}
-
-	/* invoke per-policy init */
-	for (i = 0; i < BLKCG_MAX_POLS; i++) {
-		struct blkcg_policy *pol = blkcg_policy[i];
 
+		/* invoke per-policy init */
 		if (blkcg_policy_enabled(blkg->q, pol))
 			pol->pd_init_fn(blkg);
 	}
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

__blkg_lookup_create() currently fails if radix_tree_preload() fails;
however, preload failrue doesn't imply insertion failure.  Don't fail
__blkg_lookup_create() on preload failure.

While at it, drop sparse locking annotation which no longer applies.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ab7420..197fb50 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -177,9 +177,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 					     struct request_queue *q)
-	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
+	bool preloaded;
 	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -203,9 +203,7 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
-	ret = radix_tree_preload(GFP_ATOMIC);
-	if (ret)
-		goto err_free;
+	preloaded = !radix_tree_preload(GFP_ATOMIC);
 
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
@@ -215,14 +213,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	}
 	spin_unlock(&blkcg->lock);
 
-	radix_tree_preload_end();
-
+	if (preloaded)
+		radix_tree_preload_end();
 	if (!ret)
 		return blkg;
-err_free:
-	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
+	blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

__blkg_lookup_create() currently fails if radix_tree_preload() fails;
however, preload failrue doesn't imply insertion failure.  Don't fail
__blkg_lookup_create() on preload failure.

While at it, drop sparse locking annotation which no longer applies.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ab7420..197fb50 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -177,9 +177,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 					     struct request_queue *q)
-	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
+	bool preloaded;
 	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -203,9 +203,7 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
-	ret = radix_tree_preload(GFP_ATOMIC);
-	if (ret)
-		goto err_free;
+	preloaded = !radix_tree_preload(GFP_ATOMIC);
 
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
@@ -215,14 +213,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	}
 	spin_unlock(&blkcg->lock);
 
-	radix_tree_preload_end();
-
+	if (preloaded)
+		radix_tree_preload_end();
 	if (!ret)
 		return blkg;
-err_free:
-	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
+	blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 03/11] blkcg: make root blkcg allocation use %GFP_KERNEL
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c |   58 +++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 197fb50..a8f2f03 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -91,16 +91,18 @@ static void blkg_free(struct blkcg_gq *blkg)
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
  * @q: request_queue the new blkg is associated with
+ * @gfp_mask: allocation mask to use
  *
  * Allocate a new blkg assocating @blkcg and @q.
  */
-static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
+static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
+				   gfp_t gfp_mask)
 {
 	struct blkcg_gq *blkg;
 	int i;
 
 	/* alloc and init base part */
-	blkg = kzalloc_node(sizeof(*blkg), GFP_ATOMIC, q->node);
+	blkg = kzalloc_node(sizeof(*blkg), gfp_mask, q->node);
 	if (!blkg)
 		return NULL;
 
@@ -117,7 +119,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 			continue;
 
 		/* alloc per-policy data and attach it to blkg */
-		pd = kzalloc_node(pol->pd_size, GFP_ATOMIC, q->node);
+		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
 		if (!pd) {
 			blkg_free(blkg);
 			return NULL;
@@ -175,8 +177,13 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
+/*
+ * If @new_blkg is %NULL, this function tries to allocate a new one as
+ * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
+ */
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q)
+					     struct request_queue *q,
+					     struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	bool preloaded;
@@ -189,18 +196,23 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	blkg = __blkg_lookup(blkcg, q);
 	if (blkg) {
 		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		return blkg;
+		goto out_free;
 	}
 
 	/* blkg holds a reference to blkcg */
-	if (!css_tryget(&blkcg->css))
-		return ERR_PTR(-EINVAL);
+	if (!css_tryget(&blkcg->css)) {
+		blkg = ERR_PTR(-EINVAL);
+		goto out_free;
+	}
 
 	/* allocate */
-	ret = -ENOMEM;
-	blkg = blkg_alloc(blkcg, q);
-	if (unlikely(!blkg))
-		goto err_put;
+	if (!new_blkg) {
+		ret = -ENOMEM;
+		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
+		if (unlikely(!new_blkg))
+			goto out_put;
+	}
+	blkg = new_blkg;
 
 	/* insert */
 	preloaded = !radix_tree_preload(GFP_ATOMIC);
@@ -217,10 +229,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		radix_tree_preload_end();
 	if (!ret)
 		return blkg;
-err_put:
+
+	blkg = ERR_PTR(ret);
+out_put:
 	css_put(&blkcg->css);
-	blkg_free(blkg);
-	return ERR_PTR(ret);
+out_free:
+	blkg_free(new_blkg);
+	return blkg;
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -232,7 +247,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q);
+	return __blkg_lookup_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -732,19 +747,30 @@ int blkcg_activate_policy(struct request_queue *q,
 	struct blkcg_gq *blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
+	bool preloaded;
 
 	if (blkcg_policy_enabled(q, pol))
 		return 0;
 
+	/* preallocations for root blkg */
+	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!blkg)
+		return -ENOMEM;
+
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	blk_queue_bypass_start(q);
 
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q);
+	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
 	rcu_read_unlock();
 
+	if (preloaded)
+		radix_tree_preload_end();
+
 	if (IS_ERR(blkg)) {
 		ret = PTR_ERR(blkg);
 		goto out_unlock;
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 03/11] blkcg: make root blkcg allocation use %GFP_KERNEL
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c |   58 +++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 197fb50..a8f2f03 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -91,16 +91,18 @@ static void blkg_free(struct blkcg_gq *blkg)
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
  * @q: request_queue the new blkg is associated with
+ * @gfp_mask: allocation mask to use
  *
  * Allocate a new blkg assocating @blkcg and @q.
  */
-static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
+static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
+				   gfp_t gfp_mask)
 {
 	struct blkcg_gq *blkg;
 	int i;
 
 	/* alloc and init base part */
-	blkg = kzalloc_node(sizeof(*blkg), GFP_ATOMIC, q->node);
+	blkg = kzalloc_node(sizeof(*blkg), gfp_mask, q->node);
 	if (!blkg)
 		return NULL;
 
@@ -117,7 +119,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 			continue;
 
 		/* alloc per-policy data and attach it to blkg */
-		pd = kzalloc_node(pol->pd_size, GFP_ATOMIC, q->node);
+		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
 		if (!pd) {
 			blkg_free(blkg);
 			return NULL;
@@ -175,8 +177,13 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
+/*
+ * If @new_blkg is %NULL, this function tries to allocate a new one as
+ * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
+ */
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q)
+					     struct request_queue *q,
+					     struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	bool preloaded;
@@ -189,18 +196,23 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	blkg = __blkg_lookup(blkcg, q);
 	if (blkg) {
 		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		return blkg;
+		goto out_free;
 	}
 
 	/* blkg holds a reference to blkcg */
-	if (!css_tryget(&blkcg->css))
-		return ERR_PTR(-EINVAL);
+	if (!css_tryget(&blkcg->css)) {
+		blkg = ERR_PTR(-EINVAL);
+		goto out_free;
+	}
 
 	/* allocate */
-	ret = -ENOMEM;
-	blkg = blkg_alloc(blkcg, q);
-	if (unlikely(!blkg))
-		goto err_put;
+	if (!new_blkg) {
+		ret = -ENOMEM;
+		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
+		if (unlikely(!new_blkg))
+			goto out_put;
+	}
+	blkg = new_blkg;
 
 	/* insert */
 	preloaded = !radix_tree_preload(GFP_ATOMIC);
@@ -217,10 +229,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		radix_tree_preload_end();
 	if (!ret)
 		return blkg;
-err_put:
+
+	blkg = ERR_PTR(ret);
+out_put:
 	css_put(&blkcg->css);
-	blkg_free(blkg);
-	return ERR_PTR(ret);
+out_free:
+	blkg_free(new_blkg);
+	return blkg;
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -232,7 +247,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q);
+	return __blkg_lookup_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -732,19 +747,30 @@ int blkcg_activate_policy(struct request_queue *q,
 	struct blkcg_gq *blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
+	bool preloaded;
 
 	if (blkcg_policy_enabled(q, pol))
 		return 0;
 
+	/* preallocations for root blkg */
+	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!blkg)
+		return -ENOMEM;
+
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	blk_queue_bypass_start(q);
 
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q);
+	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
 	rcu_read_unlock();
 
+	if (preloaded)
+		radix_tree_preload_end();
+
 	if (IS_ERR(blkg)) {
 		ret = PTR_ERR(blkg);
 		goto out_unlock;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 04/11] mempool: add @gfp_mask to mempool_create_node()
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c        |    4 ++--
 include/linux/mempool.h |    3 ++-
 mm/mempool.c            |   12 +++++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6cf13df..6a04dcd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -518,8 +518,8 @@ static int blk_init_free_list(struct request_queue *q)
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
 
 	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
-
+					  mempool_free_slab, request_cachep,
+					  GFP_KERNEL, q->node);
 	if (!rl->rq_pool)
 		return -ENOMEM;
 
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 7c08052..39ed62a 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -26,7 +26,8 @@ typedef struct mempool_s {
 extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 			mempool_free_t *free_fn, void *pool_data);
 extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int nid);
+			mempool_free_t *free_fn, void *pool_data,
+			gfp_t gfp_mask, int nid);
 
 extern int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask);
 extern void mempool_destroy(mempool_t *pool);
diff --git a/mm/mempool.c b/mm/mempool.c
index d904981..5499047 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -63,19 +63,21 @@ EXPORT_SYMBOL(mempool_destroy);
 mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 				mempool_free_t *free_fn, void *pool_data)
 {
-	return  mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,-1);
+	return mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,
+				   GFP_KERNEL, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(mempool_create);
 
 mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int node_id)
+			       mempool_free_t *free_fn, void *pool_data,
+			       gfp_t gfp_mask, int node_id)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
+	pool = kmalloc_node(sizeof(*pool), gfp_mask | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
-					GFP_KERNEL, node_id);
+				      gfp_mask, node_id);
 	if (!pool->elements) {
 		kfree(pool);
 		return NULL;
@@ -93,7 +95,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	while (pool->curr_nr < pool->min_nr) {
 		void *element;
 
-		element = pool->alloc(GFP_KERNEL, pool->pool_data);
+		element = pool->alloc(gfp_mask, pool->pool_data);
 		if (unlikely(!element)) {
 			mempool_destroy(pool);
 			return NULL;
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 04/11] mempool: add @gfp_mask to mempool_create_node()
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
---
 block/blk-core.c        |    4 ++--
 include/linux/mempool.h |    3 ++-
 mm/mempool.c            |   12 +++++++-----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6cf13df..6a04dcd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -518,8 +518,8 @@ static int blk_init_free_list(struct request_queue *q)
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
 
 	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
-
+					  mempool_free_slab, request_cachep,
+					  GFP_KERNEL, q->node);
 	if (!rl->rq_pool)
 		return -ENOMEM;
 
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 7c08052..39ed62a 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -26,7 +26,8 @@ typedef struct mempool_s {
 extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 			mempool_free_t *free_fn, void *pool_data);
 extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int nid);
+			mempool_free_t *free_fn, void *pool_data,
+			gfp_t gfp_mask, int nid);
 
 extern int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask);
 extern void mempool_destroy(mempool_t *pool);
diff --git a/mm/mempool.c b/mm/mempool.c
index d904981..5499047 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -63,19 +63,21 @@ EXPORT_SYMBOL(mempool_destroy);
 mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 				mempool_free_t *free_fn, void *pool_data)
 {
-	return  mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,-1);
+	return mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,
+				   GFP_KERNEL, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(mempool_create);
 
 mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int node_id)
+			       mempool_free_t *free_fn, void *pool_data,
+			       gfp_t gfp_mask, int node_id)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
+	pool = kmalloc_node(sizeof(*pool), gfp_mask | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
-					GFP_KERNEL, node_id);
+				      gfp_mask, node_id);
 	if (!pool->elements) {
 		kfree(pool);
 		return NULL;
@@ -93,7 +95,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	while (pool->curr_nr < pool->min_nr) {
 		void *element;
 
-		element = pool->alloc(GFP_KERNEL, pool->pool_data);
+		element = pool->alloc(gfp_mask, pool->pool_data);
 		if (unlikely(!element)) {
 			mempool_destroy(pool);
 			return NULL;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc}
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	James Smart,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mike Christie, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining.  fc_bsg_remove() open-codes mostly identical logic.

The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.

blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary.  Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>
Cc: Mike Christie <michaelc-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
Cc: James Smart <james.smart-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
---
 block/bsg-lib.c                     |   53 -----------------------------------
 drivers/scsi/scsi_transport_fc.c    |   38 -------------------------
 drivers/scsi/scsi_transport_iscsi.c |    2 +-
 include/linux/bsg-lib.h             |    1 -
 4 files changed, 1 insertions(+), 93 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 7ad49c8..deee61f 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -243,56 +243,3 @@ int bsg_setup_queue(struct device *dev, struct request_queue *q,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(bsg_setup_queue);
-
-/**
- * bsg_remove_queue - Deletes the bsg dev from the q
- * @q:	the request_queue that is to be torn down.
- *
- * Notes:
- *   Before unregistering the queue empty any requests that are blocked
- */
-void bsg_remove_queue(struct request_queue *q)
-{
-	struct request *req; /* block request */
-	int counts; /* totals for request_list count and starved */
-
-	if (!q)
-		return;
-
-	/* Stop taking in new requests */
-	spin_lock_irq(q->queue_lock);
-	blk_stop_queue(q);
-
-	/* drain all requests in the queue */
-	while (1) {
-		/* need the lock to fetch a request
-		 * this may fetch the same reqeust as the previous pass
-		 */
-		req = blk_fetch_request(q);
-		/* save requests in use and starved */
-		counts = q->rq.count[0] + q->rq.count[1] +
-			 q->rq.starved[0] + q->rq.starved[1];
-		spin_unlock_irq(q->queue_lock);
-		/* any requests still outstanding? */
-		if (counts == 0)
-			break;
-
-		/* This may be the same req as the previous iteration,
-		 * always send the blk_end_request_all after a prefetch.
-		 * It is not okay to not end the request because the
-		 * prefetch started the request.
-		 */
-		if (req) {
-			/* return -ENXIO to indicate that this queue is
-			 * going away
-			 */
-			req->errors = -ENXIO;
-			blk_end_request_all(req, -ENXIO);
-		}
-
-		msleep(200); /* allow bsg to possibly finish */
-		spin_lock_irq(q->queue_lock);
-	}
-	bsg_unregister_queue(q);
-}
-EXPORT_SYMBOL_GPL(bsg_remove_queue);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 80fbe2a..3aff3db 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -4126,45 +4126,7 @@ fc_bsg_rportadd(struct Scsi_Host *shost, struct fc_rport *rport)
 static void
 fc_bsg_remove(struct request_queue *q)
 {
-	struct request *req; /* block request */
-	int counts; /* totals for request_list count and starved */
-
 	if (q) {
-		/* Stop taking in new requests */
-		spin_lock_irq(q->queue_lock);
-		blk_stop_queue(q);
-
-		/* drain all requests in the queue */
-		while (1) {
-			/* need the lock to fetch a request
-			 * this may fetch the same reqeust as the previous pass
-			 */
-			req = blk_fetch_request(q);
-			/* save requests in use and starved */
-			counts = q->rq.count[0] + q->rq.count[1] +
-				q->rq.starved[0] + q->rq.starved[1];
-			spin_unlock_irq(q->queue_lock);
-			/* any requests still outstanding? */
-			if (counts == 0)
-				break;
-
-			/* This may be the same req as the previous iteration,
-			 * always send the blk_end_request_all after a prefetch.
-			 * It is not okay to not end the request because the
-			 * prefetch started the request.
-			 */
-			if (req) {
-				/* return -ENXIO to indicate that this queue is
-				 * going away
-				 */
-				req->errors = -ENXIO;
-				blk_end_request_all(req, -ENXIO);
-			}
-
-			msleep(200); /* allow bsg to possibly finish */
-			spin_lock_irq(q->queue_lock);
-		}
-
 		bsg_unregister_queue(q);
 		blk_cleanup_queue(q);
 	}
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 1cf640e..c737a16 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -575,7 +575,7 @@ static int iscsi_remove_host(struct transport_container *tc,
 	struct iscsi_cls_host *ihost = shost->shost_data;
 
 	if (ihost->bsg_q) {
-		bsg_remove_queue(ihost->bsg_q);
+		bsg_unregister_queue(ihost->bsg_q);
 		blk_cleanup_queue(ihost->bsg_q);
 	}
 	return 0;
diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h
index f55ab8c..4d0fb3d 100644
--- a/include/linux/bsg-lib.h
+++ b/include/linux/bsg-lib.h
@@ -67,7 +67,6 @@ void bsg_job_done(struct bsg_job *job, int result,
 int bsg_setup_queue(struct device *dev, struct request_queue *q, char *name,
 		    bsg_job_fn *job_fn, int dd_job_size);
 void bsg_request_fn(struct request_queue *q);
-void bsg_remove_queue(struct request_queue *q);
 void bsg_goose_queue(struct request_queue *q);
 
 #endif
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc}
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo, James Bottomley,
	Mike Christie, James Smart

iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining.  fc_bsg_remove() open-codes mostly identical logic.

The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.

blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary.  Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Smart <james.smart@emulex.com>
---
 block/bsg-lib.c                     |   53 -----------------------------------
 drivers/scsi/scsi_transport_fc.c    |   38 -------------------------
 drivers/scsi/scsi_transport_iscsi.c |    2 +-
 include/linux/bsg-lib.h             |    1 -
 4 files changed, 1 insertions(+), 93 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 7ad49c8..deee61f 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -243,56 +243,3 @@ int bsg_setup_queue(struct device *dev, struct request_queue *q,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(bsg_setup_queue);
-
-/**
- * bsg_remove_queue - Deletes the bsg dev from the q
- * @q:	the request_queue that is to be torn down.
- *
- * Notes:
- *   Before unregistering the queue empty any requests that are blocked
- */
-void bsg_remove_queue(struct request_queue *q)
-{
-	struct request *req; /* block request */
-	int counts; /* totals for request_list count and starved */
-
-	if (!q)
-		return;
-
-	/* Stop taking in new requests */
-	spin_lock_irq(q->queue_lock);
-	blk_stop_queue(q);
-
-	/* drain all requests in the queue */
-	while (1) {
-		/* need the lock to fetch a request
-		 * this may fetch the same reqeust as the previous pass
-		 */
-		req = blk_fetch_request(q);
-		/* save requests in use and starved */
-		counts = q->rq.count[0] + q->rq.count[1] +
-			 q->rq.starved[0] + q->rq.starved[1];
-		spin_unlock_irq(q->queue_lock);
-		/* any requests still outstanding? */
-		if (counts == 0)
-			break;
-
-		/* This may be the same req as the previous iteration,
-		 * always send the blk_end_request_all after a prefetch.
-		 * It is not okay to not end the request because the
-		 * prefetch started the request.
-		 */
-		if (req) {
-			/* return -ENXIO to indicate that this queue is
-			 * going away
-			 */
-			req->errors = -ENXIO;
-			blk_end_request_all(req, -ENXIO);
-		}
-
-		msleep(200); /* allow bsg to possibly finish */
-		spin_lock_irq(q->queue_lock);
-	}
-	bsg_unregister_queue(q);
-}
-EXPORT_SYMBOL_GPL(bsg_remove_queue);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 80fbe2a..3aff3db 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -4126,45 +4126,7 @@ fc_bsg_rportadd(struct Scsi_Host *shost, struct fc_rport *rport)
 static void
 fc_bsg_remove(struct request_queue *q)
 {
-	struct request *req; /* block request */
-	int counts; /* totals for request_list count and starved */
-
 	if (q) {
-		/* Stop taking in new requests */
-		spin_lock_irq(q->queue_lock);
-		blk_stop_queue(q);
-
-		/* drain all requests in the queue */
-		while (1) {
-			/* need the lock to fetch a request
-			 * this may fetch the same reqeust as the previous pass
-			 */
-			req = blk_fetch_request(q);
-			/* save requests in use and starved */
-			counts = q->rq.count[0] + q->rq.count[1] +
-				q->rq.starved[0] + q->rq.starved[1];
-			spin_unlock_irq(q->queue_lock);
-			/* any requests still outstanding? */
-			if (counts == 0)
-				break;
-
-			/* This may be the same req as the previous iteration,
-			 * always send the blk_end_request_all after a prefetch.
-			 * It is not okay to not end the request because the
-			 * prefetch started the request.
-			 */
-			if (req) {
-				/* return -ENXIO to indicate that this queue is
-				 * going away
-				 */
-				req->errors = -ENXIO;
-				blk_end_request_all(req, -ENXIO);
-			}
-
-			msleep(200); /* allow bsg to possibly finish */
-			spin_lock_irq(q->queue_lock);
-		}
-
 		bsg_unregister_queue(q);
 		blk_cleanup_queue(q);
 	}
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 1cf640e..c737a16 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -575,7 +575,7 @@ static int iscsi_remove_host(struct transport_container *tc,
 	struct iscsi_cls_host *ihost = shost->shost_data;
 
 	if (ihost->bsg_q) {
-		bsg_remove_queue(ihost->bsg_q);
+		bsg_unregister_queue(ihost->bsg_q);
 		blk_cleanup_queue(ihost->bsg_q);
 	}
 	return 0;
diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h
index f55ab8c..4d0fb3d 100644
--- a/include/linux/bsg-lib.h
+++ b/include/linux/bsg-lib.h
@@ -67,7 +67,6 @@ void bsg_job_done(struct bsg_job *job, int result,
 int bsg_setup_queue(struct device *dev, struct request_queue *q, char *name,
 		    bsg_job_fn *job_fn, int dd_job_size);
 void bsg_request_fn(struct request_queue *q);
-void bsg_remove_queue(struct request_queue *q);
 void bsg_goose_queue(struct request_queue *q);
 
 #endif
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 06/11] block: refactor get_request[_wait]()
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Currently, there are two request allocation functions - get_request()
and get_request_wait().  The former tries to allocate a request once
and the latter keeps retrying until it succeeds.  The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation.  However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
  wrapper function.

* get_request_wait() is renamed to get_request().  It now takes
  @gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c |   74 +++++++++++++++++++++++++----------------------------
 1 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6a04dcd..02b6cf8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -824,7 +824,7 @@ static struct io_context *rq_ioc(struct bio *bio)
 }
 
 /**
- * get_request - get a free request
+ * __get_request - get a free request
  * @q: request_queue to allocate request from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
@@ -837,8 +837,8 @@ static struct io_context *rq_ioc(struct bio *bio)
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+static struct request *__get_request(struct request_queue *q, int rw_flags,
+				     struct bio *bio, gfp_t gfp_mask)
 {
 	struct request *rq;
 	struct request_list *rl = &q->rq;
@@ -1016,56 +1016,55 @@ rq_starved:
 }
 
 /**
- * get_request_wait - get a free request with retry
+ * get_request - get a free request
  * @q: request_queue to allocate request from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
+ * @gfp_mask: allocation mask
  *
- * Get a free request from @q.  This function keeps retrying under memory
- * pressure and fails iff @q is dead.
+ * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
+ * function keeps retrying under memory pressure and fails iff @q is dead.
  *
  * Must be callled with @q->queue_lock held and,
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *get_request_wait(struct request_queue *q, int rw_flags,
-					struct bio *bio)
+static struct request *get_request(struct request_queue *q, int rw_flags,
+				   struct bio *bio, gfp_t gfp_mask)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	DEFINE_WAIT(wait);
+	struct request_list *rl = &q->rq;
 	struct request *rq;
+retry:
+	rq = __get_request(q, rw_flags, bio, gfp_mask);
+	if (rq)
+		return rq;
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
-		DEFINE_WAIT(wait);
-		struct request_list *rl = &q->rq;
-
-		if (unlikely(blk_queue_dead(q)))
-			return NULL;
-
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q)))
+		return NULL;
 
-		trace_block_sleeprq(q, bio, rw_flags & 1);
+	/* wait on @rl and retry */
+	prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+				  TASK_UNINTERRUPTIBLE);
 
-		spin_unlock_irq(q->queue_lock);
-		io_schedule();
+	trace_block_sleeprq(q, bio, rw_flags & 1);
 
-		/*
-		 * After sleeping, we become a "batching" process and
-		 * will be able to allocate at least one request, and
-		 * up to a big batch of them for a small period time.
-		 * See ioc_batching, ioc_set_batching
-		 */
-		create_io_context(GFP_NOIO, q->node);
-		ioc_set_batching(q, current->io_context);
+	spin_unlock_irq(q->queue_lock);
+	io_schedule();
 
-		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
+	/*
+	 * After sleeping, we become a "batching" process and will be able
+	 * to allocate at least one request, and up to a big batch of them
+	 * for a small period time.  See ioc_batching, ioc_set_batching
+	 */
+	create_io_context(GFP_NOIO, q->node);
+	ioc_set_batching(q, current->io_context);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	};
+	spin_lock_irq(q->queue_lock);
+	finish_wait(&rl->wait[is_sync], &wait);
 
-	return rq;
+	goto retry;
 }
 
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
@@ -1075,10 +1074,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
-	if (gfp_mask & __GFP_WAIT)
-		rq = get_request_wait(q, rw, NULL);
-	else
-		rq = get_request(q, rw, NULL, gfp_mask);
+	rq = get_request(q, rw, NULL, gfp_mask);
 	if (!rq)
 		spin_unlock_irq(q->queue_lock);
 	/* q->queue_lock is unlocked at this point */
@@ -1467,7 +1463,7 @@ get_rq:
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
-	req = get_request_wait(q, rw_flags, bio);
+	req = get_request(q, rw_flags, bio, GFP_NOIO);
 	if (unlikely(!req)) {
 		bio_endio(bio, -ENODEV);	/* @q is dead */
 		goto out_unlock;
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 06/11] block: refactor get_request[_wait]()
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Currently, there are two request allocation functions - get_request()
and get_request_wait().  The former tries to allocate a request once
and the latter keeps retrying until it succeeds.  The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation.  However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
  wrapper function.

* get_request_wait() is renamed to get_request().  It now takes
  @gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c |   74 +++++++++++++++++++++++++----------------------------
 1 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6a04dcd..02b6cf8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -824,7 +824,7 @@ static struct io_context *rq_ioc(struct bio *bio)
 }
 
 /**
- * get_request - get a free request
+ * __get_request - get a free request
  * @q: request_queue to allocate request from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
@@ -837,8 +837,8 @@ static struct io_context *rq_ioc(struct bio *bio)
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+static struct request *__get_request(struct request_queue *q, int rw_flags,
+				     struct bio *bio, gfp_t gfp_mask)
 {
 	struct request *rq;
 	struct request_list *rl = &q->rq;
@@ -1016,56 +1016,55 @@ rq_starved:
 }
 
 /**
- * get_request_wait - get a free request with retry
+ * get_request - get a free request
  * @q: request_queue to allocate request from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
+ * @gfp_mask: allocation mask
  *
- * Get a free request from @q.  This function keeps retrying under memory
- * pressure and fails iff @q is dead.
+ * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
+ * function keeps retrying under memory pressure and fails iff @q is dead.
  *
  * Must be callled with @q->queue_lock held and,
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *get_request_wait(struct request_queue *q, int rw_flags,
-					struct bio *bio)
+static struct request *get_request(struct request_queue *q, int rw_flags,
+				   struct bio *bio, gfp_t gfp_mask)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	DEFINE_WAIT(wait);
+	struct request_list *rl = &q->rq;
 	struct request *rq;
+retry:
+	rq = __get_request(q, rw_flags, bio, gfp_mask);
+	if (rq)
+		return rq;
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
-		DEFINE_WAIT(wait);
-		struct request_list *rl = &q->rq;
-
-		if (unlikely(blk_queue_dead(q)))
-			return NULL;
-
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q)))
+		return NULL;
 
-		trace_block_sleeprq(q, bio, rw_flags & 1);
+	/* wait on @rl and retry */
+	prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+				  TASK_UNINTERRUPTIBLE);
 
-		spin_unlock_irq(q->queue_lock);
-		io_schedule();
+	trace_block_sleeprq(q, bio, rw_flags & 1);
 
-		/*
-		 * After sleeping, we become a "batching" process and
-		 * will be able to allocate at least one request, and
-		 * up to a big batch of them for a small period time.
-		 * See ioc_batching, ioc_set_batching
-		 */
-		create_io_context(GFP_NOIO, q->node);
-		ioc_set_batching(q, current->io_context);
+	spin_unlock_irq(q->queue_lock);
+	io_schedule();
 
-		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
+	/*
+	 * After sleeping, we become a "batching" process and will be able
+	 * to allocate at least one request, and up to a big batch of them
+	 * for a small period time.  See ioc_batching, ioc_set_batching
+	 */
+	create_io_context(GFP_NOIO, q->node);
+	ioc_set_batching(q, current->io_context);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	};
+	spin_lock_irq(q->queue_lock);
+	finish_wait(&rl->wait[is_sync], &wait);
 
-	return rq;
+	goto retry;
 }
 
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
@@ -1075,10 +1074,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
-	if (gfp_mask & __GFP_WAIT)
-		rq = get_request_wait(q, rw, NULL);
-	else
-		rq = get_request(q, rw, NULL, gfp_mask);
+	rq = get_request(q, rw, NULL, gfp_mask);
 	if (!rq)
 		spin_unlock_irq(q->queue_lock);
 	/* q->queue_lock is unlocked at this point */
@@ -1467,7 +1463,7 @@ get_rq:
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
-	req = get_request_wait(q, rw_flags, bio);
+	req = get_request(q, rw_flags, bio, GFP_NOIO);
 	if (unlikely(!req)) {
 		bio_endio(bio, -ENODEV);	/* @q is dead */
 		goto out_unlock;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 07/11] block: allocate io_context upfront
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Block layer very lazy allocation of ioc.  It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer.  We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c     |   42 +++++++++++++++---------------------------
 block/blk-throttle.c |    3 ---
 2 files changed, 15 insertions(+), 30 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 02b6cf8..242bf9e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -842,15 +842,11 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq;
 	struct request_list *rl = &q->rq;
-	struct elevator_type *et;
-	struct io_context *ioc;
+	struct elevator_type *et = q->elevator->type;
+	struct io_context *ioc = rq_ioc(bio);
 	struct io_cq *icq = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	bool retried = false;
 	int may_queue;
-retry:
-	et = q->elevator->type;
-	ioc = rq_ioc(bio);
 
 	if (unlikely(blk_queue_dead(q)))
 		return NULL;
@@ -862,20 +858,6 @@ retry:
 	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
 		if (rl->count[is_sync]+1 >= q->nr_requests) {
 			/*
-			 * We want ioc to record batching state.  If it's
-			 * not already there, creating a new one requires
-			 * dropping queue_lock, which in turn requires
-			 * retesting conditions to avoid queue hang.
-			 */
-			if (!ioc && !retried) {
-				spin_unlock_irq(q->queue_lock);
-				create_io_context(gfp_mask, q->node);
-				spin_lock_irq(q->queue_lock);
-				retried = true;
-				goto retry;
-			}
-
-			/*
 			 * The queue will fill after this allocation, so set
 			 * it as full, and mark this process as "batching".
 			 * This process will be allowed to complete a batch of
@@ -942,12 +924,8 @@ retry:
 	/* init elvpriv */
 	if (rw_flags & REQ_ELVPRIV) {
 		if (unlikely(et->icq_cache && !icq)) {
-			create_io_context(gfp_mask, q->node);
-			ioc = rq_ioc(bio);
-			if (!ioc)
-				goto fail_elvpriv;
-
-			icq = ioc_create_icq(ioc, q, gfp_mask);
+			if (ioc)
+				icq = ioc_create_icq(ioc, q, gfp_mask);
 			if (!icq)
 				goto fail_elvpriv;
 		}
@@ -1058,7 +1036,6 @@ retry:
 	 * to allocate at least one request, and up to a big batch of them
 	 * for a small period time.  See ioc_batching, ioc_set_batching
 	 */
-	create_io_context(GFP_NOIO, q->node);
 	ioc_set_batching(q, current->io_context);
 
 	spin_lock_irq(q->queue_lock);
@@ -1073,6 +1050,9 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 
 	BUG_ON(rw != READ && rw != WRITE);
 
+	/* create ioc upfront */
+	create_io_context(gfp_mask, q->node);
+
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, rw, NULL, gfp_mask);
 	if (!rq)
@@ -1684,6 +1664,14 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	/*
+	 * Various block parts want %current->io_context and lazy ioc
+	 * allocation ends up trading a lot of pain for a small amount of
+	 * memory.  Just allocate it upfront.  This may fail and block
+	 * layer knows how to live with it.
+	 */
+	create_io_context(GFP_ATOMIC, q->node);
+
 	if (blk_throtl_bio(q, bio))
 		return false;	/* throttled, will be resubmitted later */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e00e9c2..dce5132 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1123,9 +1123,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
-	/* bio_associate_current() needs ioc, try creating */
-	create_io_context(GFP_ATOMIC, q->node);
-
 	/*
 	 * A throtl_grp pointer retrieved under rcu can be used to access
 	 * basic fields like stats and io rates. If a group has no rules,
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 07/11] block: allocate io_context upfront
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Block layer very lazy allocation of ioc.  It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer.  We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c     |   42 +++++++++++++++---------------------------
 block/blk-throttle.c |    3 ---
 2 files changed, 15 insertions(+), 30 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 02b6cf8..242bf9e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -842,15 +842,11 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq;
 	struct request_list *rl = &q->rq;
-	struct elevator_type *et;
-	struct io_context *ioc;
+	struct elevator_type *et = q->elevator->type;
+	struct io_context *ioc = rq_ioc(bio);
 	struct io_cq *icq = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	bool retried = false;
 	int may_queue;
-retry:
-	et = q->elevator->type;
-	ioc = rq_ioc(bio);
 
 	if (unlikely(blk_queue_dead(q)))
 		return NULL;
@@ -862,20 +858,6 @@ retry:
 	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
 		if (rl->count[is_sync]+1 >= q->nr_requests) {
 			/*
-			 * We want ioc to record batching state.  If it's
-			 * not already there, creating a new one requires
-			 * dropping queue_lock, which in turn requires
-			 * retesting conditions to avoid queue hang.
-			 */
-			if (!ioc && !retried) {
-				spin_unlock_irq(q->queue_lock);
-				create_io_context(gfp_mask, q->node);
-				spin_lock_irq(q->queue_lock);
-				retried = true;
-				goto retry;
-			}
-
-			/*
 			 * The queue will fill after this allocation, so set
 			 * it as full, and mark this process as "batching".
 			 * This process will be allowed to complete a batch of
@@ -942,12 +924,8 @@ retry:
 	/* init elvpriv */
 	if (rw_flags & REQ_ELVPRIV) {
 		if (unlikely(et->icq_cache && !icq)) {
-			create_io_context(gfp_mask, q->node);
-			ioc = rq_ioc(bio);
-			if (!ioc)
-				goto fail_elvpriv;
-
-			icq = ioc_create_icq(ioc, q, gfp_mask);
+			if (ioc)
+				icq = ioc_create_icq(ioc, q, gfp_mask);
 			if (!icq)
 				goto fail_elvpriv;
 		}
@@ -1058,7 +1036,6 @@ retry:
 	 * to allocate at least one request, and up to a big batch of them
 	 * for a small period time.  See ioc_batching, ioc_set_batching
 	 */
-	create_io_context(GFP_NOIO, q->node);
 	ioc_set_batching(q, current->io_context);
 
 	spin_lock_irq(q->queue_lock);
@@ -1073,6 +1050,9 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 
 	BUG_ON(rw != READ && rw != WRITE);
 
+	/* create ioc upfront */
+	create_io_context(gfp_mask, q->node);
+
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, rw, NULL, gfp_mask);
 	if (!rq)
@@ -1684,6 +1664,14 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	/*
+	 * Various block parts want %current->io_context and lazy ioc
+	 * allocation ends up trading a lot of pain for a small amount of
+	 * memory.  Just allocate it upfront.  This may fail and block
+	 * layer knows how to live with it.
+	 */
+	create_io_context(GFP_ATOMIC, q->node);
+
 	if (blk_throtl_bio(q, bio))
 		return false;	/* throttled, will be resubmitted later */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e00e9c2..dce5132 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1123,9 +1123,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
-	/* bio_associate_current() needs ioc, try creating */
-	create_io_context(GFP_ATOMIC, q->node);
-
 	/*
 	 * A throtl_grp pointer retrieved under rcu can be used to access
 	 * basic fields like stats and io rates. If a group has no rules,
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 08/11] blkcg: inline bio_blkcg() and friends
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Make bio_blkcg() and friends inline.  They all are very simple and
used only in few places.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c |   21 ---------------------
 block/blk-cgroup.h |   26 ++++++++++++++++++++++----
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a8f2f03..b5c155d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -31,27 +31,6 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
-struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup)
-{
-	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
-			    struct blkcg, css);
-}
-EXPORT_SYMBOL_GPL(cgroup_to_blkcg);
-
-static struct blkcg *task_blkcg(struct task_struct *tsk)
-{
-	return container_of(task_subsys_state(tsk, blkio_subsys_id),
-			    struct blkcg, css);
-}
-
-struct blkcg *bio_blkcg(struct bio *bio)
-{
-	if (bio && bio->bi_css)
-		return container_of(bio->bi_css, struct blkcg, css);
-	return task_blkcg(current);
-}
-EXPORT_SYMBOL_GPL(bio_blkcg);
-
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 8ac457c..e74cce1 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -120,8 +120,6 @@ struct blkcg_policy {
 
 extern struct blkcg blkcg_root;
 
-struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup);
-struct blkcg *bio_blkcg(struct bio *bio);
 struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 				    struct request_queue *q);
@@ -160,6 +158,25 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 void blkg_conf_finish(struct blkg_conf_ctx *ctx);
 
 
+static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+			    struct blkcg, css);
+}
+
+static inline struct blkcg *task_blkcg(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, blkio_subsys_id),
+			    struct blkcg, css);
+}
+
+static inline struct blkcg *bio_blkcg(struct bio *bio)
+{
+	if (bio && bio->bi_css)
+		return container_of(bio->bi_css, struct blkcg, css);
+	return task_blkcg(current);
+}
+
 /**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
@@ -351,6 +368,7 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
 #else	/* CONFIG_BLK_CGROUP */
 
 struct cgroup;
+struct blkcg;
 
 struct blkg_policy_data {
 };
@@ -361,8 +379,6 @@ struct blkcg_gq {
 struct blkcg_policy {
 };
 
-static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
-static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
 static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
 static inline void blkcg_drain_queue(struct request_queue *q) { }
@@ -374,6 +390,8 @@ static inline int blkcg_activate_policy(struct request_queue *q,
 static inline void blkcg_deactivate_policy(struct request_queue *q,
 					   const struct blkcg_policy *pol) { }
 
+static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
+static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
 						  struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 08/11] blkcg: inline bio_blkcg() and friends
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Make bio_blkcg() and friends inline.  They all are very simple and
used only in few places.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   21 ---------------------
 block/blk-cgroup.h |   26 ++++++++++++++++++++++----
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a8f2f03..b5c155d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -31,27 +31,6 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
-struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup)
-{
-	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
-			    struct blkcg, css);
-}
-EXPORT_SYMBOL_GPL(cgroup_to_blkcg);
-
-static struct blkcg *task_blkcg(struct task_struct *tsk)
-{
-	return container_of(task_subsys_state(tsk, blkio_subsys_id),
-			    struct blkcg, css);
-}
-
-struct blkcg *bio_blkcg(struct bio *bio)
-{
-	if (bio && bio->bi_css)
-		return container_of(bio->bi_css, struct blkcg, css);
-	return task_blkcg(current);
-}
-EXPORT_SYMBOL_GPL(bio_blkcg);
-
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 8ac457c..e74cce1 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -120,8 +120,6 @@ struct blkcg_policy {
 
 extern struct blkcg blkcg_root;
 
-struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup);
-struct blkcg *bio_blkcg(struct bio *bio);
 struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 				    struct request_queue *q);
@@ -160,6 +158,25 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
 void blkg_conf_finish(struct blkg_conf_ctx *ctx);
 
 
+static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+			    struct blkcg, css);
+}
+
+static inline struct blkcg *task_blkcg(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, blkio_subsys_id),
+			    struct blkcg, css);
+}
+
+static inline struct blkcg *bio_blkcg(struct bio *bio)
+{
+	if (bio && bio->bi_css)
+		return container_of(bio->bi_css, struct blkcg, css);
+	return task_blkcg(current);
+}
+
 /**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
@@ -351,6 +368,7 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
 #else	/* CONFIG_BLK_CGROUP */
 
 struct cgroup;
+struct blkcg;
 
 struct blkg_policy_data {
 };
@@ -361,8 +379,6 @@ struct blkcg_gq {
 struct blkcg_policy {
 };
 
-static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
-static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
 static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
 static inline void blkcg_drain_queue(struct request_queue *q) { }
@@ -374,6 +390,8 @@ static inline int blkcg_activate_policy(struct request_queue *q,
 static inline void blkcg_deactivate_policy(struct request_queue *q,
 					   const struct blkcg_policy *pol) { }
 
+static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
+static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
 						  struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 09/11] block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
       [not found] ` <1335477561-11131-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (7 preceding siblings ...)
  2012-04-26 21:59     ` Tejun Heo
@ 2012-04-26 21:59   ` Tejun Heo
  2012-04-26 21:59     ` Tejun Heo
  2012-04-26 21:59     ` Tejun Heo
  10 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c       |   13 +++++++------
 include/linux/blkdev.h |   11 ++++++-----
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 242bf9e..7377eb6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -386,7 +386,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		if (!list_empty(&q->queue_head) && q->request_fn)
 			__blk_run_queue(q);
 
-		drain |= q->rq.elvpriv;
+		drain |= q->nr_rqs_elvpriv;
 
 		/*
 		 * Unfortunately, requests are queued at and tracked from
@@ -396,7 +396,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		if (drain_all) {
 			drain |= !list_empty(&q->queue_head);
 			for (i = 0; i < 2; i++) {
-				drain |= q->rq.count[i];
+				drain |= q->nr_rqs[i];
 				drain |= q->in_flight[i];
 				drain |= !list_empty(&q->flush_queue[i]);
 			}
@@ -513,7 +513,6 @@ static int blk_init_free_list(struct request_queue *q)
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
 	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
 
@@ -778,9 +777,10 @@ static void freed_request(struct request_queue *q, unsigned int flags)
 	struct request_list *rl = &q->rq;
 	int sync = rw_is_sync(flags);
 
+	q->nr_rqs[sync]--;
 	rl->count[sync]--;
 	if (flags & REQ_ELVPRIV)
-		rl->elvpriv--;
+		q->nr_rqs_elvpriv--;
 
 	__freed_request(q, sync);
 
@@ -889,6 +889,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
 		return NULL;
 
+	q->nr_rqs[is_sync]++;
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
@@ -904,7 +905,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	 */
 	if (blk_rq_should_init_elevator(bio) && !blk_queue_bypass(q)) {
 		rw_flags |= REQ_ELVPRIV;
-		rl->elvpriv++;
+		q->nr_rqs_elvpriv++;
 		if (et->icq_cache && ioc)
 			icq = ioc_lookup_icq(ioc, q);
 	}
@@ -965,7 +966,7 @@ fail_elvpriv:
 	rq->elv.icq = NULL;
 
 	spin_lock_irq(q->queue_lock);
-	rl->elvpriv--;
+	q->nr_rqs_elvpriv--;
 	spin_unlock_irq(q->queue_lock);
 	goto out;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index af33fb1..b085be7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -50,11 +50,10 @@ struct request_list {
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
-	int count[2];
-	int starved[2];
-	int elvpriv;
-	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int			count[2];
+	int			starved[2];
+	mempool_t		*rq_pool;
+	wait_queue_head_t	wait[2];
 };
 
 /*
@@ -281,6 +280,8 @@ struct request_queue {
 	struct list_head	queue_head;
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
+	int			nr_rqs[2];	/* # allocated [a]sync rqs */
+	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
 	/*
 	 * the queue request freelist, one for reads and one for writes
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 09/11] block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
  2012-04-26 21:59 ` Tejun Heo
  (?)
  (?)
@ 2012-04-26 21:59 ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c       |   13 +++++++------
 include/linux/blkdev.h |   11 ++++++-----
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 242bf9e..7377eb6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -386,7 +386,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		if (!list_empty(&q->queue_head) && q->request_fn)
 			__blk_run_queue(q);
 
-		drain |= q->rq.elvpriv;
+		drain |= q->nr_rqs_elvpriv;
 
 		/*
 		 * Unfortunately, requests are queued at and tracked from
@@ -396,7 +396,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		if (drain_all) {
 			drain |= !list_empty(&q->queue_head);
 			for (i = 0; i < 2; i++) {
-				drain |= q->rq.count[i];
+				drain |= q->nr_rqs[i];
 				drain |= q->in_flight[i];
 				drain |= !list_empty(&q->flush_queue[i]);
 			}
@@ -513,7 +513,6 @@ static int blk_init_free_list(struct request_queue *q)
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
 	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
 
@@ -778,9 +777,10 @@ static void freed_request(struct request_queue *q, unsigned int flags)
 	struct request_list *rl = &q->rq;
 	int sync = rw_is_sync(flags);
 
+	q->nr_rqs[sync]--;
 	rl->count[sync]--;
 	if (flags & REQ_ELVPRIV)
-		rl->elvpriv--;
+		q->nr_rqs_elvpriv--;
 
 	__freed_request(q, sync);
 
@@ -889,6 +889,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
 		return NULL;
 
+	q->nr_rqs[is_sync]++;
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
@@ -904,7 +905,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	 */
 	if (blk_rq_should_init_elevator(bio) && !blk_queue_bypass(q)) {
 		rw_flags |= REQ_ELVPRIV;
-		rl->elvpriv++;
+		q->nr_rqs_elvpriv++;
 		if (et->icq_cache && ioc)
 			icq = ioc_lookup_icq(ioc, q);
 	}
@@ -965,7 +966,7 @@ fail_elvpriv:
 	rq->elv.icq = NULL;
 
 	spin_lock_irq(q->queue_lock);
-	rl->elvpriv--;
+	q->nr_rqs_elvpriv--;
 	spin_unlock_irq(q->queue_lock);
 	goto out;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index af33fb1..b085be7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -50,11 +50,10 @@ struct request_list {
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
-	int count[2];
-	int starved[2];
-	int elvpriv;
-	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int			count[2];
+	int			starved[2];
+	mempool_t		*rq_pool;
+	wait_queue_head_t	wait[2];
 };
 
 /*
@@ -281,6 +280,8 @@ struct request_queue {
 	struct list_head	queue_head;
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
+	int			nr_rqs[2];	/* # allocated [a]sync rqs */
+	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
 	/*
 	 * the queue request freelist, one for reads and one for writes
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 10/11] block: prepare for multiple request_lists
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list.  blk_*queue_full() functions
  are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
  instead of @q.  Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
  blk_release_queue().

* Add request_list->q and make request alloc/free functions -
  blk_free_request(), [__]freed_request(), __get_request() - take @rl
  instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c       |   56 ++++++++++++++++++++++++++---------------------
 block/blk-sysfs.c      |   12 ++++------
 block/blk.h            |    3 ++
 include/linux/blkdev.h |   32 +++++++++++++++------------
 4 files changed, 57 insertions(+), 46 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7377eb6..38b6d3d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -504,13 +504,13 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+int blk_init_rl(struct request_list *rl, struct request_queue *q,
+		gfp_t gfp_mask)
 {
-	struct request_list *rl = &q->rq;
-
 	if (unlikely(rl->rq_pool))
 		return 0;
 
+	rl->q = q;
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
 	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
@@ -518,13 +518,19 @@ static int blk_init_free_list(struct request_queue *q)
 
 	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
 					  mempool_free_slab, request_cachep,
-					  GFP_KERNEL, q->node);
+					  gfp_mask, q->node);
 	if (!rl->rq_pool)
 		return -ENOMEM;
 
 	return 0;
 }
 
+void blk_exit_rl(struct request_list *rl)
+{
+	if (rl->rq_pool)
+		mempool_destroy(rl->rq_pool);
+}
+
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 {
 	return blk_alloc_queue_node(gfp_mask, -1);
@@ -666,7 +672,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	if (!q)
 		return NULL;
 
-	if (blk_init_free_list(q))
+	if (blk_init_rl(&q->rq, q, GFP_KERNEL))
 		return NULL;
 
 	q->request_fn		= rfn;
@@ -708,15 +714,15 @@ bool blk_get_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_get_queue);
 
-static inline void blk_free_request(struct request_queue *q, struct request *rq)
+static inline void blk_free_request(struct request_list *rl, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV) {
-		elv_put_request(q, rq);
+		elv_put_request(rl->q, rq);
 		if (rq->elv.icq)
 			put_io_context(rq->elv.icq->ioc);
 	}
 
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, rl->rq_pool);
 }
 
 /*
@@ -753,9 +759,9 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_list *rl, int sync)
 {
-	struct request_list *rl = &q->rq;
+	struct request_queue *q = rl->q;
 
 	if (rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
@@ -764,7 +770,7 @@ static void __freed_request(struct request_queue *q, int sync)
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
 
-		blk_clear_queue_full(q, sync);
+		blk_clear_rl_full(rl, sync);
 	}
 }
 
@@ -772,9 +778,9 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, unsigned int flags)
+static void freed_request(struct request_list *rl, unsigned int flags)
 {
-	struct request_list *rl = &q->rq;
+	struct request_queue *q = rl->q;
 	int sync = rw_is_sync(flags);
 
 	q->nr_rqs[sync]--;
@@ -782,10 +788,10 @@ static void freed_request(struct request_queue *q, unsigned int flags)
 	if (flags & REQ_ELVPRIV)
 		q->nr_rqs_elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(rl, sync);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(rl, sync ^ 1);
 }
 
 /*
@@ -825,7 +831,7 @@ static struct io_context *rq_ioc(struct bio *bio)
 
 /**
  * __get_request - get a free request
- * @q: request_queue to allocate request from
+ * @rl: request list to allocate from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
  * @gfp_mask: allocation mask
@@ -837,11 +843,11 @@ static struct io_context *rq_ioc(struct bio *bio)
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *__get_request(struct request_queue *q, int rw_flags,
+static struct request *__get_request(struct request_list *rl, int rw_flags,
 				     struct bio *bio, gfp_t gfp_mask)
 {
+	struct request_queue *q = rl->q;
 	struct request *rq;
-	struct request_list *rl = &q->rq;
 	struct elevator_type *et = q->elevator->type;
 	struct io_context *ioc = rq_ioc(bio);
 	struct io_cq *icq = NULL;
@@ -863,9 +869,9 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 			 * This process will be allowed to complete a batch of
 			 * requests, others will be blocked.
 			 */
-			if (!blk_queue_full(q, is_sync)) {
+			if (!blk_rl_full(rl, is_sync)) {
 				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
+				blk_set_rl_full(rl, is_sync);
 			} else {
 				if (may_queue != ELV_MQUEUE_MUST
 						&& !ioc_batching(q, ioc)) {
@@ -915,7 +921,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	spin_unlock_irq(q->queue_lock);
 
 	/* allocate and init request */
-	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	rq = mempool_alloc(rl->rq_pool, gfp_mask);
 	if (!rq)
 		goto fail_alloc;
 
@@ -979,7 +985,7 @@ fail_alloc:
 	 * queue, but this is pretty rare.
 	 */
 	spin_lock_irq(q->queue_lock);
-	freed_request(q, rw_flags);
+	freed_request(rl, rw_flags);
 
 	/*
 	 * in the very unlikely event that allocation failed and no
@@ -1016,7 +1022,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct request *rq;
 retry:
-	rq = __get_request(q, rw_flags, bio, gfp_mask);
+	rq = __get_request(&q->rq, rw_flags, bio, gfp_mask);
 	if (rq)
 		return rq;
 
@@ -1216,8 +1222,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
-		blk_free_request(q, req);
-		freed_request(q, flags);
+		blk_free_request(&q->rq, req);
+		freed_request(&q->rq, flags);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index aa41b47..234ce7c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -66,16 +66,16 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
 	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_SYNC);
+		blk_set_rl_full(rl, BLK_RW_SYNC);
 	} else {
-		blk_clear_queue_full(q, BLK_RW_SYNC);
+		blk_clear_rl_full(rl, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
 	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_ASYNC);
+		blk_set_rl_full(rl, BLK_RW_ASYNC);
 	} else {
-		blk_clear_queue_full(q, BLK_RW_ASYNC);
+		blk_clear_rl_full(rl, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
@@ -476,7 +476,6 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
@@ -489,8 +488,7 @@ static void blk_release_queue(struct kobject *kobj)
 		elevator_exit(q->elevator);
 	}
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	blk_exit_rl(&q->rq);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/blk.h b/block/blk.h
index 85f6ae4..a134231 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -18,6 +18,9 @@ static inline void __blk_get_queue(struct request_queue *q)
 	kobject_get(&q->kobj);
 }
 
+int blk_init_rl(struct request_list *rl, struct request_queue *q,
+		gfp_t gfp_mask);
+void blk_exit_rl(struct request_list *rl);
 void init_request_from_bio(struct request *req, struct bio *bio);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b085be7..92fc25f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -45,7 +45,12 @@ struct blkcg_gq;
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
+#define BLK_RL_SYNCFULL		(1U << 0)
+#define BLK_RL_ASYNCFULL	(1U << 1)
+
 struct request_list {
+	struct request_queue	*q;	/* the queue this rl belongs to */
+
 	/*
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
@@ -54,6 +59,7 @@ struct request_list {
 	int			starved[2];
 	mempool_t		*rq_pool;
 	wait_queue_head_t	wait[2];
+	unsigned int		flags;
 };
 
 /*
@@ -565,27 +571,25 @@ static inline bool rq_is_sync(struct request *rq)
 	return rw_is_sync(rq->cmd_flags);
 }
 
-static inline int blk_queue_full(struct request_queue *q, int sync)
+static inline bool blk_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		return test_bit(QUEUE_FLAG_SYNCFULL, &q->queue_flags);
-	return test_bit(QUEUE_FLAG_ASYNCFULL, &q->queue_flags);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	return rl->flags & flag;
 }
 
-static inline void blk_set_queue_full(struct request_queue *q, int sync)
+static inline void blk_set_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		queue_flag_set(QUEUE_FLAG_SYNCFULL, q);
-	else
-		queue_flag_set(QUEUE_FLAG_ASYNCFULL, q);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	rl->flags |= flag;
 }
 
-static inline void blk_clear_queue_full(struct request_queue *q, int sync)
+static inline void blk_clear_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		queue_flag_clear(QUEUE_FLAG_SYNCFULL, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	rl->flags &= ~flag;
 }
 
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 10/11] block: prepare for multiple request_lists
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list.  blk_*queue_full() functions
  are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
  instead of @q.  Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
  blk_release_queue().

* Add request_list->q and make request alloc/free functions -
  blk_free_request(), [__]freed_request(), __get_request() - take @rl
  instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c       |   56 ++++++++++++++++++++++++++---------------------
 block/blk-sysfs.c      |   12 ++++------
 block/blk.h            |    3 ++
 include/linux/blkdev.h |   32 +++++++++++++++------------
 4 files changed, 57 insertions(+), 46 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7377eb6..38b6d3d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -504,13 +504,13 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+int blk_init_rl(struct request_list *rl, struct request_queue *q,
+		gfp_t gfp_mask)
 {
-	struct request_list *rl = &q->rq;
-
 	if (unlikely(rl->rq_pool))
 		return 0;
 
+	rl->q = q;
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
 	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
@@ -518,13 +518,19 @@ static int blk_init_free_list(struct request_queue *q)
 
 	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
 					  mempool_free_slab, request_cachep,
-					  GFP_KERNEL, q->node);
+					  gfp_mask, q->node);
 	if (!rl->rq_pool)
 		return -ENOMEM;
 
 	return 0;
 }
 
+void blk_exit_rl(struct request_list *rl)
+{
+	if (rl->rq_pool)
+		mempool_destroy(rl->rq_pool);
+}
+
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 {
 	return blk_alloc_queue_node(gfp_mask, -1);
@@ -666,7 +672,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	if (!q)
 		return NULL;
 
-	if (blk_init_free_list(q))
+	if (blk_init_rl(&q->rq, q, GFP_KERNEL))
 		return NULL;
 
 	q->request_fn		= rfn;
@@ -708,15 +714,15 @@ bool blk_get_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_get_queue);
 
-static inline void blk_free_request(struct request_queue *q, struct request *rq)
+static inline void blk_free_request(struct request_list *rl, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV) {
-		elv_put_request(q, rq);
+		elv_put_request(rl->q, rq);
 		if (rq->elv.icq)
 			put_io_context(rq->elv.icq->ioc);
 	}
 
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, rl->rq_pool);
 }
 
 /*
@@ -753,9 +759,9 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_list *rl, int sync)
 {
-	struct request_list *rl = &q->rq;
+	struct request_queue *q = rl->q;
 
 	if (rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
@@ -764,7 +770,7 @@ static void __freed_request(struct request_queue *q, int sync)
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
 
-		blk_clear_queue_full(q, sync);
+		blk_clear_rl_full(rl, sync);
 	}
 }
 
@@ -772,9 +778,9 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, unsigned int flags)
+static void freed_request(struct request_list *rl, unsigned int flags)
 {
-	struct request_list *rl = &q->rq;
+	struct request_queue *q = rl->q;
 	int sync = rw_is_sync(flags);
 
 	q->nr_rqs[sync]--;
@@ -782,10 +788,10 @@ static void freed_request(struct request_queue *q, unsigned int flags)
 	if (flags & REQ_ELVPRIV)
 		q->nr_rqs_elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(rl, sync);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(rl, sync ^ 1);
 }
 
 /*
@@ -825,7 +831,7 @@ static struct io_context *rq_ioc(struct bio *bio)
 
 /**
  * __get_request - get a free request
- * @q: request_queue to allocate request from
+ * @rl: request list to allocate from
  * @rw_flags: RW and SYNC flags
  * @bio: bio to allocate request for (can be %NULL)
  * @gfp_mask: allocation mask
@@ -837,11 +843,11 @@ static struct io_context *rq_ioc(struct bio *bio)
  * Returns %NULL on failure, with @q->queue_lock held.
  * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
-static struct request *__get_request(struct request_queue *q, int rw_flags,
+static struct request *__get_request(struct request_list *rl, int rw_flags,
 				     struct bio *bio, gfp_t gfp_mask)
 {
+	struct request_queue *q = rl->q;
 	struct request *rq;
-	struct request_list *rl = &q->rq;
 	struct elevator_type *et = q->elevator->type;
 	struct io_context *ioc = rq_ioc(bio);
 	struct io_cq *icq = NULL;
@@ -863,9 +869,9 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 			 * This process will be allowed to complete a batch of
 			 * requests, others will be blocked.
 			 */
-			if (!blk_queue_full(q, is_sync)) {
+			if (!blk_rl_full(rl, is_sync)) {
 				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
+				blk_set_rl_full(rl, is_sync);
 			} else {
 				if (may_queue != ELV_MQUEUE_MUST
 						&& !ioc_batching(q, ioc)) {
@@ -915,7 +921,7 @@ static struct request *__get_request(struct request_queue *q, int rw_flags,
 	spin_unlock_irq(q->queue_lock);
 
 	/* allocate and init request */
-	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	rq = mempool_alloc(rl->rq_pool, gfp_mask);
 	if (!rq)
 		goto fail_alloc;
 
@@ -979,7 +985,7 @@ fail_alloc:
 	 * queue, but this is pretty rare.
 	 */
 	spin_lock_irq(q->queue_lock);
-	freed_request(q, rw_flags);
+	freed_request(rl, rw_flags);
 
 	/*
 	 * in the very unlikely event that allocation failed and no
@@ -1016,7 +1022,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct request *rq;
 retry:
-	rq = __get_request(q, rw_flags, bio, gfp_mask);
+	rq = __get_request(&q->rq, rw_flags, bio, gfp_mask);
 	if (rq)
 		return rq;
 
@@ -1216,8 +1222,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
-		blk_free_request(q, req);
-		freed_request(q, flags);
+		blk_free_request(&q->rq, req);
+		freed_request(&q->rq, flags);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index aa41b47..234ce7c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -66,16 +66,16 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
 	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_SYNC);
+		blk_set_rl_full(rl, BLK_RW_SYNC);
 	} else {
-		blk_clear_queue_full(q, BLK_RW_SYNC);
+		blk_clear_rl_full(rl, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
 	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-		blk_set_queue_full(q, BLK_RW_ASYNC);
+		blk_set_rl_full(rl, BLK_RW_ASYNC);
 	} else {
-		blk_clear_queue_full(q, BLK_RW_ASYNC);
+		blk_clear_rl_full(rl, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
@@ -476,7 +476,6 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
@@ -489,8 +488,7 @@ static void blk_release_queue(struct kobject *kobj)
 		elevator_exit(q->elevator);
 	}
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	blk_exit_rl(&q->rq);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/blk.h b/block/blk.h
index 85f6ae4..a134231 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -18,6 +18,9 @@ static inline void __blk_get_queue(struct request_queue *q)
 	kobject_get(&q->kobj);
 }
 
+int blk_init_rl(struct request_list *rl, struct request_queue *q,
+		gfp_t gfp_mask);
+void blk_exit_rl(struct request_list *rl);
 void init_request_from_bio(struct request *req, struct bio *bio);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b085be7..92fc25f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -45,7 +45,12 @@ struct blkcg_gq;
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
+#define BLK_RL_SYNCFULL		(1U << 0)
+#define BLK_RL_ASYNCFULL	(1U << 1)
+
 struct request_list {
+	struct request_queue	*q;	/* the queue this rl belongs to */
+
 	/*
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
@@ -54,6 +59,7 @@ struct request_list {
 	int			starved[2];
 	mempool_t		*rq_pool;
 	wait_queue_head_t	wait[2];
+	unsigned int		flags;
 };
 
 /*
@@ -565,27 +571,25 @@ static inline bool rq_is_sync(struct request *rq)
 	return rw_is_sync(rq->cmd_flags);
 }
 
-static inline int blk_queue_full(struct request_queue *q, int sync)
+static inline bool blk_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		return test_bit(QUEUE_FLAG_SYNCFULL, &q->queue_flags);
-	return test_bit(QUEUE_FLAG_ASYNCFULL, &q->queue_flags);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	return rl->flags & flag;
 }
 
-static inline void blk_set_queue_full(struct request_queue *q, int sync)
+static inline void blk_set_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		queue_flag_set(QUEUE_FLAG_SYNCFULL, q);
-	else
-		queue_flag_set(QUEUE_FLAG_ASYNCFULL, q);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	rl->flags |= flag;
 }
 
-static inline void blk_clear_queue_full(struct request_queue *q, int sync)
+static inline void blk_clear_rl_full(struct request_list *rl, bool sync)
 {
-	if (sync)
-		queue_flag_clear(QUEUE_FLAG_SYNCFULL, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
+	unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
+
+	rl->flags &= ~flag;
 }
 
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-26 21:59 ` Tejun Heo
@ 2012-04-26 21:59     ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued.  When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup.  As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list.  Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
  which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
  handling a bit harier, this enables avoiding most overhead for root
  blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
  aware yet, so congestion state currently just follows the root
  blkcg.  As writeback isn't aware of blkcg yet, this works okay for
  async congestion but readahead may get the wrong signals.  It's
  better than blkcg completely collapsing with shared request_list but
  needs to be improved with future changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Cc: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 block/blk-cgroup.c     |   51 ++++++++++++++++++++++++--
 block/blk-cgroup.h     |   95 ++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-core.c       |   33 +++++++++++++----
 block/blk-sysfs.c      |   32 ++++++++++-------
 include/linux/blkdev.h |   12 +++++--
 5 files changed, 195 insertions(+), 28 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b5c155d..c7761cb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -63,6 +63,7 @@ static void blkg_free(struct blkcg_gq *blkg)
 		kfree(pd);
 	}
 
+	blk_exit_rl(&blkg->rl);
 	kfree(blkg);
 }
 
@@ -90,6 +91,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 
+	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
+	if (blkcg != &blkcg_root) {
+		if (blk_init_rl(&blkg->rl, q, gfp_mask))
+			goto err_free;
+		blkg->rl.blkg = blkg;
+	}
+
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkg_policy_data *pd;
@@ -99,10 +107,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 		/* alloc per-policy data and attach it to blkg */
 		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
-		if (!pd) {
-			blkg_free(blkg);
-			return NULL;
-		}
+		if (!pd)
+			goto err_free;
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
@@ -113,6 +119,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	}
 
 	return blkg;
+
+err_free:
+	blkg_free(blkg);
+	return NULL;
 }
 
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
@@ -305,6 +315,38 @@ void __blkg_release(struct blkcg_gq *blkg)
 }
 EXPORT_SYMBOL_GPL(__blkg_release);
 
+/*
+ * The next function used by blk_queue_for_each_rl().  It's a bit tricky
+ * because the root blkg uses @q->root_rl instead of its own rl.
+ */
+struct request_list *__blk_queue_next_rl(struct request_list *rl,
+					 struct request_queue *q)
+{
+	struct list_head *ent;
+	struct blkcg_gq *blkg;
+
+	/*
+	 * Determine the current blkg list_head.  The first entry is
+	 * root_rl which is off @q->blkg_list and mapped to the head.
+	 */
+	if (rl == &q->root_rl) {
+		ent = &q->blkg_list;
+	} else {
+		blkg = container_of(rl, struct blkcg_gq, rl);
+		ent = &blkg->q_node;
+	}
+
+	/* walk to the next list_head, skip root blkcg */
+	ent = ent->next;
+	if (ent == &q->root_blkg->q_node)
+		ent = ent->next;
+	if (ent == &q->blkg_list)
+		return NULL;
+
+	blkg = container_of(ent, struct blkcg_gq, q_node);
+	return &blkg->rl;
+}
+
 static int blkcg_reset_stats(struct cgroup *cgroup, struct cftype *cftype,
 			     u64 val)
 {
@@ -755,6 +797,7 @@ int blkcg_activate_policy(struct request_queue *q,
 		goto out_unlock;
 	}
 	q->root_blkg = blkg;
+	q->root_rl.blkg = blkg;
 
 	list_for_each_entry(blkg, &q->blkg_list, q_node)
 		cnt++;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index e74cce1..93da70b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -17,6 +17,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/seq_file.h>
 #include <linux/radix-tree.h>
+#include <linux/blkdev.h>
 
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
@@ -93,6 +94,8 @@ struct blkcg_gq {
 	struct list_head		q_node;
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
+	/* request allocation list for this blkcg-q pair */
+	struct request_list		rl;
 	/* reference count */
 	int				refcnt;
 
@@ -251,6 +254,88 @@ static inline void blkg_put(struct blkcg_gq *blkg)
 }
 
 /**
+ * blk_get_rl - get request_list to use
+ * @q: request_queue of interest
+ * @bio: bio which will be attached to the allocated request (may be %NULL)
+ *
+ * The caller wants to allocate a request from @q to use for @bio.  Find
+ * the request_list to use and obtain a reference on it.  Should be called
+ * under queue_lock.  This function is guaranteed to return non-%NULL
+ * request_list.
+ */
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+					      struct bio *bio)
+{
+	struct blkcg *blkcg;
+	struct blkcg_gq *blkg;
+
+	rcu_read_lock();
+
+	blkcg = bio_blkcg(bio);
+
+	/* bypass blkg lookup and use @q->root_rl directly for root */
+	if (blkcg == &blkcg_root) {
+		rcu_read_unlock();
+		return &q->root_rl;
+	}
+
+	blkg = blkg_lookup_create(blkcg, q);
+	blkg_get(blkg);
+
+	rcu_read_unlock();
+
+	return &blkg->rl;
+}
+
+/**
+ * blk_put_rl - put request_list
+ * @rl: request_list to put
+ *
+ * Put the reference acquired by blk_get_rl().  Should be called under
+ * queue_lock.
+ */
+static inline void blk_put_rl(struct request_list *rl)
+{
+	/* root_rl may not have blkg set */
+	if (rl->blkg && rl->blkg->blkcg != &blkcg_root)
+		blkg_put(rl->blkg);
+}
+
+/**
+ * blk_rq_set_rl - associate a request with a request_list
+ * @rq: request of interest
+ * @rl: target request_list
+ *
+ * Associate @rq with @rl so that accounting and freeing can know the
+ * request_list @rq came from.
+ */
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
+{
+	rq->rl = rl;
+}
+
+/**
+ * blk_rq_rl - return the request_list a request came from
+ * @rq: request of interest
+ *
+ * Return the request_list @rq is allocated from.
+ */
+static inline struct request_list *blk_rq_rl(struct request *rq)
+{
+	return rq->rl;
+}
+
+struct request_list *__blk_queue_next_rl(struct request_list *rl,
+					 struct request_queue *q);
+/**
+ * blk_queue_for_each_rl - iterate through all request_lists of a request_queue
+ *
+ * Should be used under queue_lock.
+ */
+#define blk_queue_for_each_rl(rl, q)	\
+	for ((rl) = &(q)->root_rl; (rl); (rl) = __blk_queue_next_rl((rl), (q)))
+
+/**
  * blkg_stat_add - add a value to a blkg_stat
  * @stat: target blkg_stat
  * @val: value to add
@@ -392,6 +477,7 @@ static inline void blkcg_deactivate_policy(struct request_queue *q,
 
 static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
 static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
+
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
 						  struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
@@ -399,5 +485,14 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+					      struct bio *bio) { return &q->root_rl; }
+static inline void blk_put_rl(struct request_list *rl) { }
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
+static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
+
+#define blk_queue_for_each_rl(rl, q)	\
+	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
+
 #endif	/* CONFIG_BLK_CGROUP */
 #endif	/* _BLK_CGROUP_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index 38b6d3d..5c4f4a1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -672,7 +672,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	if (!q)
 		return NULL;
 
-	if (blk_init_rl(&q->rq, q, GFP_KERNEL))
+	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
 		return NULL;
 
 	q->request_fn		= rfn;
@@ -763,7 +763,12 @@ static void __freed_request(struct request_list *rl, int sync)
 {
 	struct request_queue *q = rl->q;
 
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	/*
+	 * bdi isn't aware of blkcg yet.  As all async IOs end up root
+	 * blkcg anyway, just use root blkcg state.
+	 */
+	if (rl == &q->root_rl &&
+	    rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
 	if (rl->count[sync] + 1 <= q->nr_requests) {
@@ -884,7 +889,12 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
 				}
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
+		/*
+		 * bdi isn't aware of blkcg yet.  As all async IOs end up
+		 * root blkcg anyway, just use root blkcg state.
+		 */
+		if (rl == &q->root_rl)
+			blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
 		goto fail_alloc;
 
 	blk_rq_init(q, rq);
+	blk_rq_set_rl(rq, rl);
 	rq->cmd_flags = rw_flags | REQ_ALLOCED;
 
 	/* init elvpriv */
@@ -1019,15 +1030,19 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	DEFINE_WAIT(wait);
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	struct request *rq;
+
+	rl = blk_get_rl(q, bio);	/* transferred to @rq on success */
 retry:
-	rq = __get_request(&q->rq, rw_flags, bio, gfp_mask);
+	rq = __get_request(rl, rw_flags, bio, gfp_mask);
 	if (rq)
 		return rq;
 
-	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q)))
+	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q))) {
+		blk_put_rl(rl);
 		return NULL;
+	}
 
 	/* wait on @rl and retry */
 	prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
@@ -1218,12 +1233,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	 */
 	if (req->cmd_flags & REQ_ALLOCED) {
 		unsigned int flags = req->cmd_flags;
+		struct request_list *rl = blk_rq_rl(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
-		blk_free_request(&q->rq, req);
-		freed_request(&q->rq, flags);
+		blk_free_request(rl, req);
+		freed_request(rl, flags);
+		blk_put_rl(rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 234ce7c..9628b29 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -40,7 +40,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret;
 
@@ -55,6 +55,9 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
+	/* congestion isn't cgroup aware and follows root blkcg for now */
+	rl = &q->root_rl;
+
 	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
 	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
@@ -65,19 +68,22 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-		blk_set_rl_full(rl, BLK_RW_SYNC);
-	} else {
-		blk_clear_rl_full(rl, BLK_RW_SYNC);
-		wake_up(&rl->wait[BLK_RW_SYNC]);
+	blk_queue_for_each_rl(rl, q) {
+		if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+			blk_set_rl_full(rl, BLK_RW_SYNC);
+		} else {
+			blk_clear_rl_full(rl, BLK_RW_SYNC);
+			wake_up(&rl->wait[BLK_RW_SYNC]);
+		}
+
+		if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+			blk_set_rl_full(rl, BLK_RW_ASYNC);
+		} else {
+			blk_clear_rl_full(rl, BLK_RW_ASYNC);
+			wake_up(&rl->wait[BLK_RW_ASYNC]);
+		}
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-		blk_set_rl_full(rl, BLK_RW_ASYNC);
-	} else {
-		blk_clear_rl_full(rl, BLK_RW_ASYNC);
-		wake_up(&rl->wait[BLK_RW_ASYNC]);
-	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
@@ -488,7 +494,7 @@ static void blk_release_queue(struct kobject *kobj)
 		elevator_exit(q->elevator);
 	}
 
-	blk_exit_rl(&q->rq);
+	blk_exit_rl(&q->root_rl);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92fc25f..a989e9b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -50,7 +50,9 @@ typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	struct request_queue	*q;	/* the queue this rl belongs to */
-
+#ifdef CONFIG_BLK_CGROUP
+	struct blkcg_gq		*blkg;	/* blkg this request pool belongs to */
+#endif
 	/*
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
@@ -142,6 +144,7 @@ struct request {
 	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
+	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
 #endif
@@ -290,9 +293,12 @@ struct request_queue {
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
 	/*
-	 * the queue request freelist, one for reads and one for writes
+	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
+	 * is used, root blkg allocates from @q->root_rl and all other
+	 * blkgs from their own blkg->rl.  Which one to use should be
+	 * determined using bio_request_list().
 	 */
-	struct request_list	rq;
+	struct request_list	root_rl;
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-26 21:59     ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-26 21:59 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, Tejun Heo

Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued.  When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup.  As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list.  Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
  which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
  handling a bit harier, this enables avoiding most overhead for root
  blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
  aware yet, so congestion state currently just follows the root
  blkcg.  As writeback isn't aware of blkcg yet, this works okay for
  async congestion but readahead may get the wrong signals.  It's
  better than blkcg completely collapsing with shared request_list but
  needs to be improved with future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c     |   51 ++++++++++++++++++++++++--
 block/blk-cgroup.h     |   95 ++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-core.c       |   33 +++++++++++++----
 block/blk-sysfs.c      |   32 ++++++++++-------
 include/linux/blkdev.h |   12 +++++--
 5 files changed, 195 insertions(+), 28 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b5c155d..c7761cb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -63,6 +63,7 @@ static void blkg_free(struct blkcg_gq *blkg)
 		kfree(pd);
 	}
 
+	blk_exit_rl(&blkg->rl);
 	kfree(blkg);
 }
 
@@ -90,6 +91,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 
+	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
+	if (blkcg != &blkcg_root) {
+		if (blk_init_rl(&blkg->rl, q, gfp_mask))
+			goto err_free;
+		blkg->rl.blkg = blkg;
+	}
+
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkg_policy_data *pd;
@@ -99,10 +107,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 		/* alloc per-policy data and attach it to blkg */
 		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
-		if (!pd) {
-			blkg_free(blkg);
-			return NULL;
-		}
+		if (!pd)
+			goto err_free;
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
@@ -113,6 +119,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	}
 
 	return blkg;
+
+err_free:
+	blkg_free(blkg);
+	return NULL;
 }
 
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
@@ -305,6 +315,38 @@ void __blkg_release(struct blkcg_gq *blkg)
 }
 EXPORT_SYMBOL_GPL(__blkg_release);
 
+/*
+ * The next function used by blk_queue_for_each_rl().  It's a bit tricky
+ * because the root blkg uses @q->root_rl instead of its own rl.
+ */
+struct request_list *__blk_queue_next_rl(struct request_list *rl,
+					 struct request_queue *q)
+{
+	struct list_head *ent;
+	struct blkcg_gq *blkg;
+
+	/*
+	 * Determine the current blkg list_head.  The first entry is
+	 * root_rl which is off @q->blkg_list and mapped to the head.
+	 */
+	if (rl == &q->root_rl) {
+		ent = &q->blkg_list;
+	} else {
+		blkg = container_of(rl, struct blkcg_gq, rl);
+		ent = &blkg->q_node;
+	}
+
+	/* walk to the next list_head, skip root blkcg */
+	ent = ent->next;
+	if (ent == &q->root_blkg->q_node)
+		ent = ent->next;
+	if (ent == &q->blkg_list)
+		return NULL;
+
+	blkg = container_of(ent, struct blkcg_gq, q_node);
+	return &blkg->rl;
+}
+
 static int blkcg_reset_stats(struct cgroup *cgroup, struct cftype *cftype,
 			     u64 val)
 {
@@ -755,6 +797,7 @@ int blkcg_activate_policy(struct request_queue *q,
 		goto out_unlock;
 	}
 	q->root_blkg = blkg;
+	q->root_rl.blkg = blkg;
 
 	list_for_each_entry(blkg, &q->blkg_list, q_node)
 		cnt++;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index e74cce1..93da70b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -17,6 +17,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/seq_file.h>
 #include <linux/radix-tree.h>
+#include <linux/blkdev.h>
 
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
@@ -93,6 +94,8 @@ struct blkcg_gq {
 	struct list_head		q_node;
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
+	/* request allocation list for this blkcg-q pair */
+	struct request_list		rl;
 	/* reference count */
 	int				refcnt;
 
@@ -251,6 +254,88 @@ static inline void blkg_put(struct blkcg_gq *blkg)
 }
 
 /**
+ * blk_get_rl - get request_list to use
+ * @q: request_queue of interest
+ * @bio: bio which will be attached to the allocated request (may be %NULL)
+ *
+ * The caller wants to allocate a request from @q to use for @bio.  Find
+ * the request_list to use and obtain a reference on it.  Should be called
+ * under queue_lock.  This function is guaranteed to return non-%NULL
+ * request_list.
+ */
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+					      struct bio *bio)
+{
+	struct blkcg *blkcg;
+	struct blkcg_gq *blkg;
+
+	rcu_read_lock();
+
+	blkcg = bio_blkcg(bio);
+
+	/* bypass blkg lookup and use @q->root_rl directly for root */
+	if (blkcg == &blkcg_root) {
+		rcu_read_unlock();
+		return &q->root_rl;
+	}
+
+	blkg = blkg_lookup_create(blkcg, q);
+	blkg_get(blkg);
+
+	rcu_read_unlock();
+
+	return &blkg->rl;
+}
+
+/**
+ * blk_put_rl - put request_list
+ * @rl: request_list to put
+ *
+ * Put the reference acquired by blk_get_rl().  Should be called under
+ * queue_lock.
+ */
+static inline void blk_put_rl(struct request_list *rl)
+{
+	/* root_rl may not have blkg set */
+	if (rl->blkg && rl->blkg->blkcg != &blkcg_root)
+		blkg_put(rl->blkg);
+}
+
+/**
+ * blk_rq_set_rl - associate a request with a request_list
+ * @rq: request of interest
+ * @rl: target request_list
+ *
+ * Associate @rq with @rl so that accounting and freeing can know the
+ * request_list @rq came from.
+ */
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
+{
+	rq->rl = rl;
+}
+
+/**
+ * blk_rq_rl - return the request_list a request came from
+ * @rq: request of interest
+ *
+ * Return the request_list @rq is allocated from.
+ */
+static inline struct request_list *blk_rq_rl(struct request *rq)
+{
+	return rq->rl;
+}
+
+struct request_list *__blk_queue_next_rl(struct request_list *rl,
+					 struct request_queue *q);
+/**
+ * blk_queue_for_each_rl - iterate through all request_lists of a request_queue
+ *
+ * Should be used under queue_lock.
+ */
+#define blk_queue_for_each_rl(rl, q)	\
+	for ((rl) = &(q)->root_rl; (rl); (rl) = __blk_queue_next_rl((rl), (q)))
+
+/**
  * blkg_stat_add - add a value to a blkg_stat
  * @stat: target blkg_stat
  * @val: value to add
@@ -392,6 +477,7 @@ static inline void blkcg_deactivate_policy(struct request_queue *q,
 
 static inline struct blkcg *cgroup_to_blkcg(struct cgroup *cgroup) { return NULL; }
 static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
+
 static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
 						  struct blkcg_policy *pol) { return NULL; }
 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
@@ -399,5 +485,14 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+					      struct bio *bio) { return &q->root_rl; }
+static inline void blk_put_rl(struct request_list *rl) { }
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
+static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
+
+#define blk_queue_for_each_rl(rl, q)	\
+	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
+
 #endif	/* CONFIG_BLK_CGROUP */
 #endif	/* _BLK_CGROUP_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index 38b6d3d..5c4f4a1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -672,7 +672,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	if (!q)
 		return NULL;
 
-	if (blk_init_rl(&q->rq, q, GFP_KERNEL))
+	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
 		return NULL;
 
 	q->request_fn		= rfn;
@@ -763,7 +763,12 @@ static void __freed_request(struct request_list *rl, int sync)
 {
 	struct request_queue *q = rl->q;
 
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	/*
+	 * bdi isn't aware of blkcg yet.  As all async IOs end up root
+	 * blkcg anyway, just use root blkcg state.
+	 */
+	if (rl == &q->root_rl &&
+	    rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
 	if (rl->count[sync] + 1 <= q->nr_requests) {
@@ -884,7 +889,12 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
 				}
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
+		/*
+		 * bdi isn't aware of blkcg yet.  As all async IOs end up
+		 * root blkcg anyway, just use root blkcg state.
+		 */
+		if (rl == &q->root_rl)
+			blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
 		goto fail_alloc;
 
 	blk_rq_init(q, rq);
+	blk_rq_set_rl(rq, rl);
 	rq->cmd_flags = rw_flags | REQ_ALLOCED;
 
 	/* init elvpriv */
@@ -1019,15 +1030,19 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	DEFINE_WAIT(wait);
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	struct request *rq;
+
+	rl = blk_get_rl(q, bio);	/* transferred to @rq on success */
 retry:
-	rq = __get_request(&q->rq, rw_flags, bio, gfp_mask);
+	rq = __get_request(rl, rw_flags, bio, gfp_mask);
 	if (rq)
 		return rq;
 
-	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q)))
+	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dead(q))) {
+		blk_put_rl(rl);
 		return NULL;
+	}
 
 	/* wait on @rl and retry */
 	prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
@@ -1218,12 +1233,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	 */
 	if (req->cmd_flags & REQ_ALLOCED) {
 		unsigned int flags = req->cmd_flags;
+		struct request_list *rl = blk_rq_rl(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
-		blk_free_request(&q->rq, req);
-		freed_request(&q->rq, flags);
+		blk_free_request(rl, req);
+		freed_request(rl, flags);
+		blk_put_rl(rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 234ce7c..9628b29 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -40,7 +40,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret;
 
@@ -55,6 +55,9 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
+	/* congestion isn't cgroup aware and follows root blkcg for now */
+	rl = &q->root_rl;
+
 	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
 	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
@@ -65,19 +68,22 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-		blk_set_rl_full(rl, BLK_RW_SYNC);
-	} else {
-		blk_clear_rl_full(rl, BLK_RW_SYNC);
-		wake_up(&rl->wait[BLK_RW_SYNC]);
+	blk_queue_for_each_rl(rl, q) {
+		if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+			blk_set_rl_full(rl, BLK_RW_SYNC);
+		} else {
+			blk_clear_rl_full(rl, BLK_RW_SYNC);
+			wake_up(&rl->wait[BLK_RW_SYNC]);
+		}
+
+		if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+			blk_set_rl_full(rl, BLK_RW_ASYNC);
+		} else {
+			blk_clear_rl_full(rl, BLK_RW_ASYNC);
+			wake_up(&rl->wait[BLK_RW_ASYNC]);
+		}
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-		blk_set_rl_full(rl, BLK_RW_ASYNC);
-	} else {
-		blk_clear_rl_full(rl, BLK_RW_ASYNC);
-		wake_up(&rl->wait[BLK_RW_ASYNC]);
-	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
@@ -488,7 +494,7 @@ static void blk_release_queue(struct kobject *kobj)
 		elevator_exit(q->elevator);
 	}
 
-	blk_exit_rl(&q->rq);
+	blk_exit_rl(&q->root_rl);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92fc25f..a989e9b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -50,7 +50,9 @@ typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	struct request_queue	*q;	/* the queue this rl belongs to */
-
+#ifdef CONFIG_BLK_CGROUP
+	struct blkcg_gq		*blkg;	/* blkg this request pool belongs to */
+#endif
 	/*
 	 * count[], starved[], and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
@@ -142,6 +144,7 @@ struct request {
 	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
+	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
 #endif
@@ -290,9 +293,12 @@ struct request_queue {
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
 	/*
-	 * the queue request freelist, one for reads and one for writes
+	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
+	 * is used, root blkg allocates from @q->root_rl and all other
+	 * blkgs from their own blkg->rl.  Which one to use should be
+	 * determined using bio_request_list().
 	 */
-	struct request_list	rq;
+	struct request_list	root_rl;
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
       [not found]     ` <1335477561-11131-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-27 14:26       ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> When policy data allocation fails in the middle, blkg_alloc() invokes
> blkg_free() to destroy the half constructed blkg.  This ends up
> calling pd_exit_fn() on policy datas which didn't go through
> pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> immediately after each policy data allocation.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/blk-cgroup.c |    6 +-----
>  1 files changed, 1 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 02cf633..4ab7420 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> -	}
> -
> -	/* invoke per-policy init */
> -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> -		struct blkcg_policy *pol = blkcg_policy[i];
>  
> +		/* invoke per-policy init */
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			pol->pd_init_fn(blkg);

Deja Vu. In one of the mails I had said that how about moving init_fn
in upper loop and get rid of for loop below. Then retracted it saying
probably you wanted to allocate all the groups first before calling 
init functions of individual policies. Here we are back again for a
different reason though. :-)

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek
>  	}
> -- 
> 1.7.7.3

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
       [not found]     ` <1335477561-11131-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-27 14:26       ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> When policy data allocation fails in the middle, blkg_alloc() invokes
> blkg_free() to destroy the half constructed blkg.  This ends up
> calling pd_exit_fn() on policy datas which didn't go through
> pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> immediately after each policy data allocation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/blk-cgroup.c |    6 +-----
>  1 files changed, 1 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 02cf633..4ab7420 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> -	}
> -
> -	/* invoke per-policy init */
> -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> -		struct blkcg_policy *pol = blkcg_policy[i];
>  
> +		/* invoke per-policy init */
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			pol->pd_init_fn(blkg);

Deja Vu. In one of the mails I had said that how about moving init_fn
in upper loop and get rid of for loop below. Then retracted it saying
probably you wanted to allocate all the groups first before calling 
init functions of individual policies. Here we are back again for a
different reason though. :-)

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek
>  	}
> -- 
> 1.7.7.3

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
@ 2012-04-27 14:26       ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> When policy data allocation fails in the middle, blkg_alloc() invokes
> blkg_free() to destroy the half constructed blkg.  This ends up
> calling pd_exit_fn() on policy datas which didn't go through
> pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> immediately after each policy data allocation.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/blk-cgroup.c |    6 +-----
>  1 files changed, 1 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 02cf633..4ab7420 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> -	}
> -
> -	/* invoke per-policy init */
> -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> -		struct blkcg_policy *pol = blkcg_policy[i];
>  
> +		/* invoke per-policy init */
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			pol->pd_init_fn(blkg);

Deja Vu. In one of the mails I had said that how about moving init_fn
in upper loop and get rid of for loop below. Then retracted it saying
probably you wanted to allocate all the groups first before calling 
init functions of individual policies. Here we are back again for a
different reason though. :-)

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek
>  	}
> -- 
> 1.7.7.3

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
       [not found]       ` <20120427142652.GH10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 14:27         ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 10:26:52AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> > When policy data allocation fails in the middle, blkg_alloc() invokes
> > blkg_free() to destroy the half constructed blkg.  This ends up
> > calling pd_exit_fn() on policy datas which didn't go through
> > pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> > immediately after each policy data allocation.
> > 
> > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  block/blk-cgroup.c |    6 +-----
> >  1 files changed, 1 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 02cf633..4ab7420 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
> >  
> >  		blkg->pd[i] = pd;
> >  		pd->blkg = blkg;
> > -	}
> > -
> > -	/* invoke per-policy init */
> > -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> > -		struct blkcg_policy *pol = blkcg_policy[i];
> >  
> > +		/* invoke per-policy init */
> >  		if (blkcg_policy_enabled(blkg->q, pol))
> >  			pol->pd_init_fn(blkg);
> 
> Deja Vu. In one of the mails I had said that how about moving init_fn
> in upper loop and get rid of for loop below. Then retracted it saying
> probably you wanted to allocate all the groups first before calling 
> init functions of individual policies. Here we are back again for a
> different reason though. :-)

Heh, yeah, should have updated it then. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
       [not found]       ` <20120427142652.GH10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 14:27         ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 10:26:52AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> > When policy data allocation fails in the middle, blkg_alloc() invokes
> > blkg_free() to destroy the half constructed blkg.  This ends up
> > calling pd_exit_fn() on policy datas which didn't go through
> > pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> > immediately after each policy data allocation.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/blk-cgroup.c |    6 +-----
> >  1 files changed, 1 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 02cf633..4ab7420 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
> >  
> >  		blkg->pd[i] = pd;
> >  		pd->blkg = blkg;
> > -	}
> > -
> > -	/* invoke per-policy init */
> > -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> > -		struct blkcg_policy *pol = blkcg_policy[i];
> >  
> > +		/* invoke per-policy init */
> >  		if (blkcg_policy_enabled(blkg->q, pol))
> >  			pol->pd_init_fn(blkg);
> 
> Deja Vu. In one of the mails I had said that how about moving init_fn
> in upper loop and get rid of for loop below. Then retracted it saying
> probably you wanted to allocate all the groups first before calling 
> init functions of individual policies. Here we are back again for a
> different reason though. :-)

Heh, yeah, should have updated it then. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/11] blkcg: fix blkg_alloc() failure path
@ 2012-04-27 14:27         ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Apr 27, 2012 at 10:26:52AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:11PM -0700, Tejun Heo wrote:
> > When policy data allocation fails in the middle, blkg_alloc() invokes
> > blkg_free() to destroy the half constructed blkg.  This ends up
> > calling pd_exit_fn() on policy datas which didn't go through
> > pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
> > immediately after each policy data allocation.
> > 
> > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  block/blk-cgroup.c |    6 +-----
> >  1 files changed, 1 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 02cf633..4ab7420 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -125,12 +125,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
> >  
> >  		blkg->pd[i] = pd;
> >  		pd->blkg = blkg;
> > -	}
> > -
> > -	/* invoke per-policy init */
> > -	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> > -		struct blkcg_policy *pol = blkcg_policy[i];
> >  
> > +		/* invoke per-policy init */
> >  		if (blkcg_policy_enabled(blkg->q, pol))
> >  			pol->pd_init_fn(blkg);
> 
> Deja Vu. In one of the mails I had said that how about moving init_fn
> in upper loop and get rid of for loop below. Then retracted it saying
> probably you wanted to allocate all the groups first before calling 
> init functions of individual policies. Here we are back again for a
> different reason though. :-)

Heh, yeah, should have updated it then. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
  2012-04-26 21:59     ` Tejun Heo
@ 2012-04-27 14:42         ` Vivek Goyal
  -1 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 14:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Thu, Apr 26, 2012 at 02:59:12PM -0700, Tejun Heo wrote:
> __blkg_lookup_create() currently fails if radix_tree_preload() fails;
> however, preload failrue doesn't imply insertion failure.  Don't fail
> __blkg_lookup_create() on preload failure.
> 

Hi Tejun,

If we are going to try the insertion anyway irrespective of the fact
whether preload succeeded or not, they why call radix_tree_preload()
at all? How does that help?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
@ 2012-04-27 14:42         ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 14:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

On Thu, Apr 26, 2012 at 02:59:12PM -0700, Tejun Heo wrote:
> __blkg_lookup_create() currently fails if radix_tree_preload() fails;
> however, preload failrue doesn't imply insertion failure.  Don't fail
> __blkg_lookup_create() on preload failure.
> 

Hi Tejun,

If we are going to try the insertion anyway irrespective of the fact
whether preload succeeded or not, they why call radix_tree_preload()
at all? How does that help?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
       [not found]         ` <20120427144258.GI10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 14:47           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

Hello,

On Fri, Apr 27, 2012 at 10:42:58AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:12PM -0700, Tejun Heo wrote:
> > __blkg_lookup_create() currently fails if radix_tree_preload() fails;
> > however, preload failrue doesn't imply insertion failure.  Don't fail
> > __blkg_lookup_create() on preload failure.
> > 
> 
> If we are going to try the insertion anyway irrespective of the fact
> whether preload succeeded or not, they why call radix_tree_preload()
> at all? How does that help?

Hmmm... it seems I originally misread radix_tree_node_alloc() - I
thought it didn't go through kmem_cache_alloc() if gfp_mask didn't
contain __GFP_WAIT.  If we don't use more permissible GFP flag during
preloading there's no point in preloading.  Will drop it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
       [not found]         ` <20120427144258.GI10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 14:47           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Hello,

On Fri, Apr 27, 2012 at 10:42:58AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:12PM -0700, Tejun Heo wrote:
> > __blkg_lookup_create() currently fails if radix_tree_preload() fails;
> > however, preload failrue doesn't imply insertion failure.  Don't fail
> > __blkg_lookup_create() on preload failure.
> > 
> 
> If we are going to try the insertion anyway irrespective of the fact
> whether preload succeeded or not, they why call radix_tree_preload()
> at all? How does that help?

Hmmm... it seems I originally misread radix_tree_node_alloc() - I
thought it didn't go through kmem_cache_alloc() if gfp_mask didn't
contain __GFP_WAIT.  If we don't use more permissible GFP flag during
preloading there's no point in preloading.  Will drop it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure
@ 2012-04-27 14:47           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 14:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Fri, Apr 27, 2012 at 10:42:58AM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:12PM -0700, Tejun Heo wrote:
> > __blkg_lookup_create() currently fails if radix_tree_preload() fails;
> > however, preload failrue doesn't imply insertion failure.  Don't fail
> > __blkg_lookup_create() on preload failure.
> > 
> 
> If we are going to try the insertion anyway irrespective of the fact
> whether preload succeeded or not, they why call radix_tree_preload()
> at all? How does that help?

Hmmm... it seems I originally misread radix_tree_node_alloc() - I
thought it didn't go through kmem_cache_alloc() if gfp_mask didn't
contain __GFP_WAIT.  If we don't use more permissible GFP flag during
preloading there's no point in preloading.  Will drop it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-26 21:59     ` Tejun Heo
@ 2012-04-27 14:54         ` Jeff Moyer
  -1 siblings, 0 replies; 77+ messages in thread
From: Jeff Moyer @ 2012-04-27 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> This patch implements per-blkg request_list.  Each blkg has its own
> request_list and any IO allocates its request from the matching blkg
> making blkcgs completely isolated in terms of request allocation.

So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
right?  Are you at all concerned about the amount of memory that can be
tied up as the number of cgroups increases?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 14:54         ` Jeff Moyer
  0 siblings, 0 replies; 77+ messages in thread
From: Jeff Moyer @ 2012-04-27 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Tejun Heo <tj@kernel.org> writes:

> This patch implements per-blkg request_list.  Each blkg has its own
> request_list and any IO allocates its request from the matching blkg
> making blkcgs completely isolated in terms of request allocation.

So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
right?  Are you at all concerned about the amount of memory that can be
tied up as the number of cgroups increases?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]         ` <x49wr51usxi.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
@ 2012-04-27 15:02           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hello,

On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > This patch implements per-blkg request_list.  Each blkg has its own
> > request_list and any IO allocates its request from the matching blkg
> > making blkcgs completely isolated in terms of request allocation.
> 
> So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> right?  Are you at all concerned about the amount of memory that can be
> tied up as the number of cgroups increases?

Yeah, I thought about it and I don't think there's a single good
solution here.  The other extreme would be splitting nr_requests by
the number of cgroups but that seems even worse - each cgroup should
be able to hit maximum throughput.  Given that a lot of workloads tend
to regulate themselves before hitting nr_requests, I think it's best
to leave it as-is and treat each cgroup as having separate channel for
now.  It's a configurable parameter after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]         ` <x49wr51usxi.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
@ 2012-04-27 15:02           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: axboe, vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Hello,

On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > This patch implements per-blkg request_list.  Each blkg has its own
> > request_list and any IO allocates its request from the matching blkg
> > making blkcgs completely isolated in terms of request allocation.
> 
> So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> right?  Are you at all concerned about the amount of memory that can be
> tied up as the number of cgroups increases?

Yeah, I thought about it and I don't think there's a single good
solution here.  The other extreme would be splitting nr_requests by
the number of cgroups but that seems even worse - each cgroup should
be able to hit maximum throughput.  Given that a lot of workloads tend
to regulate themselves before hitting nr_requests, I think it's best
to leave it as-is and treat each cgroup as having separate channel for
now.  It's a configurable parameter after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:02           ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > This patch implements per-blkg request_list.  Each blkg has its own
> > request_list and any IO allocates its request from the matching blkg
> > making blkcgs completely isolated in terms of request allocation.
> 
> So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> right?  Are you at all concerned about the amount of memory that can be
> tied up as the number of cgroups increases?

Yeah, I thought about it and I don't think there's a single good
solution here.  The other extreme would be splitting nr_requests by
the number of cgroups but that seems even worse - each cgroup should
be able to hit maximum throughput.  Given that a lot of workloads tend
to regulate themselves before hitting nr_requests, I think it's best
to leave it as-is and treat each cgroup as having separate channel for
now.  It's a configurable parameter after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 15:02           ` Tejun Heo
@ 2012-04-27 15:40               ` Vivek Goyal
  -1 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > This patch implements per-blkg request_list.  Each blkg has its own
> > > request_list and any IO allocates its request from the matching blkg
> > > making blkcgs completely isolated in terms of request allocation.
> > 
> > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > right?  Are you at all concerned about the amount of memory that can be
> > tied up as the number of cgroups increases?
> 
> Yeah, I thought about it and I don't think there's a single good
> solution here.  The other extreme would be splitting nr_requests by
> the number of cgroups but that seems even worse - each cgroup should
> be able to hit maximum throughput.  Given that a lot of workloads tend
> to regulate themselves before hitting nr_requests, I think it's best
> to leave it as-is and treat each cgroup as having separate channel for
> now.  It's a configurable parameter after all.

So on a slow device a malicious application can easily create thousands
of group, queue up tons of IO and create unreclaimable memory easily?
Sounds little scary. 

I had used two separate limits. Per queue limit and per group limit
(nr_requests and nr_group_requests). That had made implementation 
complex and relied on user doing the right configuration so that one
cgroup does not get serialized behind other once we hit nr_requests.
I am not advocating that solution as it was not very nice either.

Hmm.., tricky...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:40               ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > This patch implements per-blkg request_list.  Each blkg has its own
> > > request_list and any IO allocates its request from the matching blkg
> > > making blkcgs completely isolated in terms of request allocation.
> > 
> > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > right?  Are you at all concerned about the amount of memory that can be
> > tied up as the number of cgroups increases?
> 
> Yeah, I thought about it and I don't think there's a single good
> solution here.  The other extreme would be splitting nr_requests by
> the number of cgroups but that seems even worse - each cgroup should
> be able to hit maximum throughput.  Given that a lot of workloads tend
> to regulate themselves before hitting nr_requests, I think it's best
> to leave it as-is and treat each cgroup as having separate channel for
> now.  It's a configurable parameter after all.

So on a slow device a malicious application can easily create thousands
of group, queue up tons of IO and create unreclaimable memory easily?
Sounds little scary. 

I had used two separate limits. Per queue limit and per group limit
(nr_requests and nr_group_requests). That had made implementation 
complex and relied on user doing the right configuration so that one
cgroup does not get serialized behind other once we hit nr_requests.
I am not advocating that solution as it was not very nice either.

Hmm.., tricky...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]               ` <20120427154033.GJ10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 15:45                 ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > request_list and any IO allocates its request from the matching blkg
> > > > making blkcgs completely isolated in terms of request allocation.
> > > 
> > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > right?  Are you at all concerned about the amount of memory that can be
> > > tied up as the number of cgroups increases?
> > 
> > Yeah, I thought about it and I don't think there's a single good
> > solution here.  The other extreme would be splitting nr_requests by
> > the number of cgroups but that seems even worse - each cgroup should
> > be able to hit maximum throughput.  Given that a lot of workloads tend
> > to regulate themselves before hitting nr_requests, I think it's best
> > to leave it as-is and treat each cgroup as having separate channel for
> > now.  It's a configurable parameter after all.
> 
> So on a slow device a malicious application can easily create thousands
> of group, queue up tons of IO and create unreclaimable memory easily?
> Sounds little scary. 

Malicious application may just jack up nr_requests.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]               ` <20120427154033.GJ10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 15:45                 ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > request_list and any IO allocates its request from the matching blkg
> > > > making blkcgs completely isolated in terms of request allocation.
> > > 
> > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > right?  Are you at all concerned about the amount of memory that can be
> > > tied up as the number of cgroups increases?
> > 
> > Yeah, I thought about it and I don't think there's a single good
> > solution here.  The other extreme would be splitting nr_requests by
> > the number of cgroups but that seems even worse - each cgroup should
> > be able to hit maximum throughput.  Given that a lot of workloads tend
> > to regulate themselves before hitting nr_requests, I think it's best
> > to leave it as-is and treat each cgroup as having separate channel for
> > now.  It's a configurable parameter after all.
> 
> So on a slow device a malicious application can easily create thousands
> of group, queue up tons of IO and create unreclaimable memory easily?
> Sounds little scary. 

Malicious application may just jack up nr_requests.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:45                 ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > request_list and any IO allocates its request from the matching blkg
> > > > making blkcgs completely isolated in terms of request allocation.
> > > 
> > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > right?  Are you at all concerned about the amount of memory that can be
> > > tied up as the number of cgroups increases?
> > 
> > Yeah, I thought about it and I don't think there's a single good
> > solution here.  The other extreme would be splitting nr_requests by
> > the number of cgroups but that seems even worse - each cgroup should
> > be able to hit maximum throughput.  Given that a lot of workloads tend
> > to regulate themselves before hitting nr_requests, I think it's best
> > to leave it as-is and treat each cgroup as having separate channel for
> > now.  It's a configurable parameter after all.
> 
> So on a slow device a malicious application can easily create thousands
> of group, queue up tons of IO and create unreclaimable memory easily?
> Sounds little scary. 

Malicious application may just jack up nr_requests.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                 ` <20120427154502.GM27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 15:48                   ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 08:45:02AM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> > On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > > request_list and any IO allocates its request from the matching blkg
> > > > > making blkcgs completely isolated in terms of request allocation.
> > > > 
> > > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > > right?  Are you at all concerned about the amount of memory that can be
> > > > tied up as the number of cgroups increases?
> > > 
> > > Yeah, I thought about it and I don't think there's a single good
> > > solution here.  The other extreme would be splitting nr_requests by
> > > the number of cgroups but that seems even worse - each cgroup should
> > > be able to hit maximum throughput.  Given that a lot of workloads tend
> > > to regulate themselves before hitting nr_requests, I think it's best
> > > to leave it as-is and treat each cgroup as having separate channel for
> > > now.  It's a configurable parameter after all.
> > 
> > So on a slow device a malicious application can easily create thousands
> > of group, queue up tons of IO and create unreclaimable memory easily?
> > Sounds little scary. 
> 
> Malicious application may just jack up nr_requests.

Not an unpriviliged malicious application. In typical cgroup scenario, we
can allow unpriviliged users to create child cgroups so that it can
further subdivide its resources to its children group. (ex. put firefox
in one cgroup, open office in another group etc.).

So it is not same as jack up nr_requests.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                 ` <20120427154502.GM27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 15:48                   ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 08:45:02AM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> > On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > > request_list and any IO allocates its request from the matching blkg
> > > > > making blkcgs completely isolated in terms of request allocation.
> > > > 
> > > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > > right?  Are you at all concerned about the amount of memory that can be
> > > > tied up as the number of cgroups increases?
> > > 
> > > Yeah, I thought about it and I don't think there's a single good
> > > solution here.  The other extreme would be splitting nr_requests by
> > > the number of cgroups but that seems even worse - each cgroup should
> > > be able to hit maximum throughput.  Given that a lot of workloads tend
> > > to regulate themselves before hitting nr_requests, I think it's best
> > > to leave it as-is and treat each cgroup as having separate channel for
> > > now.  It's a configurable parameter after all.
> > 
> > So on a slow device a malicious application can easily create thousands
> > of group, queue up tons of IO and create unreclaimable memory easily?
> > Sounds little scary. 
> 
> Malicious application may just jack up nr_requests.

Not an unpriviliged malicious application. In typical cgroup scenario, we
can allow unpriviliged users to create child cgroups so that it can
further subdivide its resources to its children group. (ex. put firefox
in one cgroup, open office in another group etc.).

So it is not same as jack up nr_requests.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:48                   ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Apr 27, 2012 at 08:45:02AM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 11:40:34AM -0400, Vivek Goyal wrote:
> > On Fri, Apr 27, 2012 at 08:02:17AM -0700, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Apr 27, 2012 at 10:54:01AM -0400, Jeff Moyer wrote:
> > > > > This patch implements per-blkg request_list.  Each blkg has its own
> > > > > request_list and any IO allocates its request from the matching blkg
> > > > > making blkcgs completely isolated in terms of request allocation.
> > > > 
> > > > So, nr_requests is now actually nr_requests * # of blk cgroups.  Is that
> > > > right?  Are you at all concerned about the amount of memory that can be
> > > > tied up as the number of cgroups increases?
> > > 
> > > Yeah, I thought about it and I don't think there's a single good
> > > solution here.  The other extreme would be splitting nr_requests by
> > > the number of cgroups but that seems even worse - each cgroup should
> > > be able to hit maximum throughput.  Given that a lot of workloads tend
> > > to regulate themselves before hitting nr_requests, I think it's best
> > > to leave it as-is and treat each cgroup as having separate channel for
> > > now.  It's a configurable parameter after all.
> > 
> > So on a slow device a malicious application can easily create thousands
> > of group, queue up tons of IO and create unreclaimable memory easily?
> > Sounds little scary. 
> 
> Malicious application may just jack up nr_requests.

Not an unpriviliged malicious application. In typical cgroup scenario, we
can allow unpriviliged users to create child cgroups so that it can
further subdivide its resources to its children group. (ex. put firefox
in one cgroup, open office in another group etc.).

So it is not same as jack up nr_requests.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 15:48                   ` Vivek Goyal
@ 2012-04-27 15:51                       ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> Not an unpriviliged malicious application. In typical cgroup scenario, we
> can allow unpriviliged users to create child cgroups so that it can
> further subdivide its resources to its children group. (ex. put firefox
> in one cgroup, open office in another group etc.).
> 
> So it is not same as jack up nr_requests.

I find allowing unpriv users creating cgroups dumb.  cgroup consumes
kernel memory.  Sans using kmemcg, what prevents them from creating
gazillion cgroups and consuming all memories?  The idea of allowing
cgroups to !priv users is just broken from the get go.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:51                       ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 15:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> Not an unpriviliged malicious application. In typical cgroup scenario, we
> can allow unpriviliged users to create child cgroups so that it can
> further subdivide its resources to its children group. (ex. put firefox
> in one cgroup, open office in another group etc.).
> 
> So it is not same as jack up nr_requests.

I find allowing unpriv users creating cgroups dumb.  cgroup consumes
kernel memory.  Sans using kmemcg, what prevents them from creating
gazillion cgroups and consuming all memories?  The idea of allowing
cgroups to !priv users is just broken from the get go.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 15:51                       ` Tejun Heo
@ 2012-04-27 15:56                           ` Vivek Goyal
  -1 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 08:51:40AM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> > Not an unpriviliged malicious application. In typical cgroup scenario, we
> > can allow unpriviliged users to create child cgroups so that it can
> > further subdivide its resources to its children group. (ex. put firefox
> > in one cgroup, open office in another group etc.).
> > 
> > So it is not same as jack up nr_requests.
> 
> I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> kernel memory.  Sans using kmemcg, what prevents them from creating
> gazillion cgroups and consuming all memories?  The idea of allowing
> cgroups to !priv users is just broken from the get go.

Well creating a task consumes memory too but we allow unpriv users to
create tasks. :-)

May be a system wide cgroup limit will make sense?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 15:56                           ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 15:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 08:51:40AM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> > Not an unpriviliged malicious application. In typical cgroup scenario, we
> > can allow unpriviliged users to create child cgroups so that it can
> > further subdivide its resources to its children group. (ex. put firefox
> > in one cgroup, open office in another group etc.).
> > 
> > So it is not same as jack up nr_requests.
> 
> I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> kernel memory.  Sans using kmemcg, what prevents them from creating
> gazillion cgroups and consuming all memories?  The idea of allowing
> cgroups to !priv users is just broken from the get go.

Well creating a task consumes memory too but we allow unpriv users to
create tasks. :-)

May be a system wide cgroup limit will make sense?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 15:56                           ` Vivek Goyal
@ 2012-04-27 16:19                               ` Vivek Goyal
  -1 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 16:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> On Fri, Apr 27, 2012 at 08:51:40AM -0700, Tejun Heo wrote:
> > On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> > > Not an unpriviliged malicious application. In typical cgroup scenario, we
> > > can allow unpriviliged users to create child cgroups so that it can
> > > further subdivide its resources to its children group. (ex. put firefox
> > > in one cgroup, open office in another group etc.).
> > > 
> > > So it is not same as jack up nr_requests.
> > 
> > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > kernel memory.  Sans using kmemcg, what prevents them from creating
> > gazillion cgroups and consuming all memories?  The idea of allowing
> > cgroups to !priv users is just broken from the get go.
> 
> Well creating a task consumes memory too but we allow unpriv users to
> create tasks. :-)

Well, kernel can kill tasks and reclaim that memory so this is not an
appropriate example. 

A more suitable example probably is AIO where kernel pins down some
memory and we limit that amount by upper limit on number of aio requests.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 16:19                               ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 16:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> On Fri, Apr 27, 2012 at 08:51:40AM -0700, Tejun Heo wrote:
> > On Fri, Apr 27, 2012 at 11:48:41AM -0400, Vivek Goyal wrote:
> > > Not an unpriviliged malicious application. In typical cgroup scenario, we
> > > can allow unpriviliged users to create child cgroups so that it can
> > > further subdivide its resources to its children group. (ex. put firefox
> > > in one cgroup, open office in another group etc.).
> > > 
> > > So it is not same as jack up nr_requests.
> > 
> > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > kernel memory.  Sans using kmemcg, what prevents them from creating
> > gazillion cgroups and consuming all memories?  The idea of allowing
> > cgroups to !priv users is just broken from the get go.
> 
> Well creating a task consumes memory too but we allow unpriv users to
> create tasks. :-)

Well, kernel can kill tasks and reclaim that memory so this is not an
appropriate example. 

A more suitable example probably is AIO where kernel pins down some
memory and we limit that amount by upper limit on number of aio requests.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 15:56                           ` Vivek Goyal
@ 2012-04-27 16:20                               ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 16:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

Hello,

On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > kernel memory.  Sans using kmemcg, what prevents them from creating
> > gazillion cgroups and consuming all memories?  The idea of allowing
> > cgroups to !priv users is just broken from the get go.
> 
> Well creating a task consumes memory too but we allow unpriv users to
> create tasks. :-)

We have ulimit.

> May be a system wide cgroup limit will make sense?

IMHO, this was one of the larger mistakes cgroup has made.  There are
two ways when building interface for admin stuff like this, you can
either implement and expose the core functionality and let userland
deal with distribution or build things such that the kernel can fully
virtualize and distribute the control to each process.  Both
approaches have their pros and cons but I generally think it's better
to go for the latter for new and extra stuff like cgroup as it is much
simpler and tends to more flexible and adapts better as use cases
develop.

The problem with cgroup is that it's neither the former or the latter.
It's caught somewhere in the middle with its pants down where it does
half-assed job of providing an interface which looks like it could be
made to be directly accessible from !priv processes while not really
being able to handle such usage.

I mean, just think about the case you just raised.  Forget about
memory usage.  What about weights?  If you allow a random user to
create arbitrary number of blkcg groups, [s]he gets 500 extra weight
with each blkcg!  Yeah!

If we support full hierarchy on all controllers, exposing cgroups
directly to !priv users may start to make more sense but I'd much
prefer having resource policy controlled and administered centrally in
userland.  It's a job much better suited for userland.  If such
mechanism would require certain features, sure we can accomodate that
but I think trying to allow !priv users directly to cgroup is stupid
especially at this point, so let's just drop it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 16:20                               ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 16:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm

Hello,

On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > kernel memory.  Sans using kmemcg, what prevents them from creating
> > gazillion cgroups and consuming all memories?  The idea of allowing
> > cgroups to !priv users is just broken from the get go.
> 
> Well creating a task consumes memory too but we allow unpriv users to
> create tasks. :-)

We have ulimit.

> May be a system wide cgroup limit will make sense?

IMHO, this was one of the larger mistakes cgroup has made.  There are
two ways when building interface for admin stuff like this, you can
either implement and expose the core functionality and let userland
deal with distribution or build things such that the kernel can fully
virtualize and distribute the control to each process.  Both
approaches have their pros and cons but I generally think it's better
to go for the latter for new and extra stuff like cgroup as it is much
simpler and tends to more flexible and adapts better as use cases
develop.

The problem with cgroup is that it's neither the former or the latter.
It's caught somewhere in the middle with its pants down where it does
half-assed job of providing an interface which looks like it could be
made to be directly accessible from !priv processes while not really
being able to handle such usage.

I mean, just think about the case you just raised.  Forget about
memory usage.  What about weights?  If you allow a random user to
create arbitrary number of blkcg groups, [s]he gets 500 extra weight
with each blkcg!  Yeah!

If we support full hierarchy on all controllers, exposing cgroups
directly to !priv users may start to make more sense but I'd much
prefer having resource policy controlled and administered centrally in
userland.  It's a job much better suited for userland.  If such
mechanism would require certain features, sure we can accomodate that
but I think trying to allow !priv users directly to cgroup is stupid
especially at this point, so let's just drop it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                               ` <20120427162012.GP27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 17:21                                 ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 17:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	Daniel P. Berrange, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 09:20:12AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> > > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > > kernel memory.  Sans using kmemcg, what prevents them from creating
> > > gazillion cgroups and consuming all memories?  The idea of allowing
> > > cgroups to !priv users is just broken from the get go.
> > 
> > Well creating a task consumes memory too but we allow unpriv users to
> > create tasks. :-)
> 
> We have ulimit.
> 
> > May be a system wide cgroup limit will make sense?
> 
> IMHO, this was one of the larger mistakes cgroup has made.  There are
> two ways when building interface for admin stuff like this, you can
> either implement and expose the core functionality and let userland
> deal with distribution or build things such that the kernel can fully
> virtualize and distribute the control to each process.  Both
> approaches have their pros and cons but I generally think it's better
> to go for the latter for new and extra stuff like cgroup as it is much
> simpler and tends to more flexible and adapts better as use cases
> develop.
> 
> The problem with cgroup is that it's neither the former or the latter.
> It's caught somewhere in the middle with its pants down where it does
> half-assed job of providing an interface which looks like it could be
> made to be directly accessible from !priv processes while not really
> being able to handle such usage.
> 
> I mean, just think about the case you just raised.  Forget about
> memory usage.  What about weights?  If you allow a random user to
> create arbitrary number of blkcg groups, [s]he gets 500 extra weight
> with each blkcg!  Yeah!

This is a concern only with flat hierarhcy. With full hierarchcal
it becomes a non-issue like cpu controller.

> 
> If we support full hierarchy on all controllers, exposing cgroups
> directly to !priv users may start to make more sense but I'd much
> prefer having resource policy controlled and administered centrally in
> userland.  It's a job much better suited for userland.  If such
> mechanism would require certain features, sure we can accomodate that
> but I think trying to allow !priv users directly to cgroup is stupid
> especially at this point, so let's just drop it.

For non-priviliged users, something along the lines of per session
cpu autogroup might make sense.  But even then if some IO is submitted
from that autoblkgroup, kernel can't claim that memory till IO is
completed.

So per cgroup number of request will probably be a problem even if
kenrel managed those completely.

So are you planning to put a patch in kernel to disallow cgroup creation
for non-priviliged users?

I am CCing Daniel Berrange (libvirt), who create cgroups for virtual
machines and containers. Just in case he is relying on creating cgroups
in unprivliged mode.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                               ` <20120427162012.GP27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 17:21                                 ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 17:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm, Daniel P. Berrange

On Fri, Apr 27, 2012 at 09:20:12AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> > > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > > kernel memory.  Sans using kmemcg, what prevents them from creating
> > > gazillion cgroups and consuming all memories?  The idea of allowing
> > > cgroups to !priv users is just broken from the get go.
> > 
> > Well creating a task consumes memory too but we allow unpriv users to
> > create tasks. :-)
> 
> We have ulimit.
> 
> > May be a system wide cgroup limit will make sense?
> 
> IMHO, this was one of the larger mistakes cgroup has made.  There are
> two ways when building interface for admin stuff like this, you can
> either implement and expose the core functionality and let userland
> deal with distribution or build things such that the kernel can fully
> virtualize and distribute the control to each process.  Both
> approaches have their pros and cons but I generally think it's better
> to go for the latter for new and extra stuff like cgroup as it is much
> simpler and tends to more flexible and adapts better as use cases
> develop.
> 
> The problem with cgroup is that it's neither the former or the latter.
> It's caught somewhere in the middle with its pants down where it does
> half-assed job of providing an interface which looks like it could be
> made to be directly accessible from !priv processes while not really
> being able to handle such usage.
> 
> I mean, just think about the case you just raised.  Forget about
> memory usage.  What about weights?  If you allow a random user to
> create arbitrary number of blkcg groups, [s]he gets 500 extra weight
> with each blkcg!  Yeah!

This is a concern only with flat hierarhcy. With full hierarchcal
it becomes a non-issue like cpu controller.

> 
> If we support full hierarchy on all controllers, exposing cgroups
> directly to !priv users may start to make more sense but I'd much
> prefer having resource policy controlled and administered centrally in
> userland.  It's a job much better suited for userland.  If such
> mechanism would require certain features, sure we can accomodate that
> but I think trying to allow !priv users directly to cgroup is stupid
> especially at this point, so let's just drop it.

For non-priviliged users, something along the lines of per session
cpu autogroup might make sense.  But even then if some IO is submitted
from that autoblkgroup, kernel can't claim that memory till IO is
completed.

So per cgroup number of request will probably be a problem even if
kenrel managed those completely.

So are you planning to put a patch in kernel to disallow cgroup creation
for non-priviliged users?

I am CCing Daniel Berrange (libvirt), who create cgroups for virtual
machines and containers. Just in case he is relying on creating cgroups
in unprivliged mode.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 17:21                                 ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 17:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jeff Moyer, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Daniel P. Berrange

On Fri, Apr 27, 2012 at 09:20:12AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 27, 2012 at 11:56:12AM -0400, Vivek Goyal wrote:
> > > I find allowing unpriv users creating cgroups dumb.  cgroup consumes
> > > kernel memory.  Sans using kmemcg, what prevents them from creating
> > > gazillion cgroups and consuming all memories?  The idea of allowing
> > > cgroups to !priv users is just broken from the get go.
> > 
> > Well creating a task consumes memory too but we allow unpriv users to
> > create tasks. :-)
> 
> We have ulimit.
> 
> > May be a system wide cgroup limit will make sense?
> 
> IMHO, this was one of the larger mistakes cgroup has made.  There are
> two ways when building interface for admin stuff like this, you can
> either implement and expose the core functionality and let userland
> deal with distribution or build things such that the kernel can fully
> virtualize and distribute the control to each process.  Both
> approaches have their pros and cons but I generally think it's better
> to go for the latter for new and extra stuff like cgroup as it is much
> simpler and tends to more flexible and adapts better as use cases
> develop.
> 
> The problem with cgroup is that it's neither the former or the latter.
> It's caught somewhere in the middle with its pants down where it does
> half-assed job of providing an interface which looks like it could be
> made to be directly accessible from !priv processes while not really
> being able to handle such usage.
> 
> I mean, just think about the case you just raised.  Forget about
> memory usage.  What about weights?  If you allow a random user to
> create arbitrary number of blkcg groups, [s]he gets 500 extra weight
> with each blkcg!  Yeah!

This is a concern only with flat hierarhcy. With full hierarchcal
it becomes a non-issue like cpu controller.

> 
> If we support full hierarchy on all controllers, exposing cgroups
> directly to !priv users may start to make more sense but I'd much
> prefer having resource policy controlled and administered centrally in
> userland.  It's a job much better suited for userland.  If such
> mechanism would require certain features, sure we can accomodate that
> but I think trying to allow !priv users directly to cgroup is stupid
> especially at this point, so let's just drop it.

For non-priviliged users, something along the lines of per session
cpu autogroup might make sense.  But even then if some IO is submitted
from that autoblkgroup, kernel can't claim that memory till IO is
completed.

So per cgroup number of request will probably be a problem even if
kenrel managed those completely.

So are you planning to put a patch in kernel to disallow cgroup creation
for non-priviliged users?

I am CCing Daniel Berrange (libvirt), who create cgroups for virtual
machines and containers. Just in case he is relying on creating cgroups
in unprivliged mode.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                                 ` <20120427172110.GM10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 17:25                                   ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 17:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	Daniel P. Berrange, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

Hello,

On Fri, Apr 27, 2012 at 01:21:10PM -0400, Vivek Goyal wrote:
> For non-priviliged users, something along the lines of per session
> cpu autogroup might make sense.  But even then if some IO is submitted
> from that autoblkgroup, kernel can't claim that memory till IO is
> completed.
> 
> So per cgroup number of request will probably be a problem even if
> kenrel managed those completely.

My point was that trying to solve all the policy decisions in kernel
proper is not a very good idea.

> So are you planning to put a patch in kernel to disallow cgroup creation
> for non-priviliged users?

No, I'm not gonna break the current users.  It's just not the
direction I want to take cgroup towards and the current breakages will
remain broken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]                                 ` <20120427172110.GM10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 17:25                                   ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 17:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe, ctalbott, rni, linux-kernel, cgroups,
	containers, fengguang.wu, hughd, akpm, Daniel P. Berrange

Hello,

On Fri, Apr 27, 2012 at 01:21:10PM -0400, Vivek Goyal wrote:
> For non-priviliged users, something along the lines of per session
> cpu autogroup might make sense.  But even then if some IO is submitted
> from that autoblkgroup, kernel can't claim that memory till IO is
> completed.
> 
> So per cgroup number of request will probably be a problem even if
> kenrel managed those completely.

My point was that trying to solve all the policy decisions in kernel
proper is not a very good idea.

> So are you planning to put a patch in kernel to disallow cgroup creation
> for non-priviliged users?

No, I'm not gonna break the current users.  It's just not the
direction I want to take cgroup towards and the current breakages will
remain broken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 17:25                                   ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 17:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Daniel P. Berrange

Hello,

On Fri, Apr 27, 2012 at 01:21:10PM -0400, Vivek Goyal wrote:
> For non-priviliged users, something along the lines of per session
> cpu autogroup might make sense.  But even then if some IO is submitted
> from that autoblkgroup, kernel can't claim that memory till IO is
> completed.
> 
> So per cgroup number of request will probably be a problem even if
> kenrel managed those completely.

My point was that trying to solve all the policy decisions in kernel
proper is not a very good idea.

> So are you planning to put a patch in kernel to disallow cgroup creation
> for non-priviliged users?

No, I'm not gonna break the current users.  It's just not the
direction I want to take cgroup towards and the current breakages will
remain broken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-26 21:59     ` Tejun Heo
@ 2012-04-27 19:46         ` Vivek Goyal
  -1 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 19:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Thu, Apr 26, 2012 at 02:59:21PM -0700, Tejun Heo wrote:

[..]
> @@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
>  		goto fail_alloc;
>  
>  	blk_rq_init(q, rq);
> +	blk_rq_set_rl(rq, rl);

Given the fact that we have established the rq and blkg releation at
the time of allocation, should we modify CFQ to just use that relation
instread of trying to lookup group again based on bio.

We avoid one lookup also we avoid duplicate creation of blkg in following
corner case of bio==NULL

	- blkg_get_rl()
	- request allocation fails. sleep, drop queue lock
	- process is moved to a different cgroup. origincal cgroup is
	  deleted. pre_destroy will cleanup all blkg on blkcg.
	- process wakes up, request allocated, set_request sets up new blkg
 	  based on new cgroup. Now a request is queued in one blkg/cgroup and
 	  it has come out of the quota of other blkg/cgroup.

Well, I have a question. Ideally nobody should be linking any more blkg
to a blkcg once blkg_pre_destroy() has been called? But can it happen
that bio_associate_current() takes are reference to blkcg and bio is
throttled. cgroup associated with bio is deleted resulting in
pre_destroy(). Now bio is submitted to CFQ and it will try to create
a new blkg for blkcg-queue pair and once IO is complete, bio will drop
blkcg reference, which in turn will free up blkcg and associated blkg
is still around and will not be cleaned up.

IOW, looks like we need a mechanism to mark a blkcg dead (set in
pre_destroy() call) and any submissions to blkcg after that should result
in bio being divered to root group?

If we reuse the rl->blkg during CFQ submission, we will avoid that problem
as blkg IO is being submitted has already been disconnected from blkcg
list and hopefully that's not a problem and IO can still be submitted
in this blkg.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 19:46         ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 19:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

On Thu, Apr 26, 2012 at 02:59:21PM -0700, Tejun Heo wrote:

[..]
> @@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
>  		goto fail_alloc;
>  
>  	blk_rq_init(q, rq);
> +	blk_rq_set_rl(rq, rl);

Given the fact that we have established the rq and blkg releation at
the time of allocation, should we modify CFQ to just use that relation
instread of trying to lookup group again based on bio.

We avoid one lookup also we avoid duplicate creation of blkg in following
corner case of bio==NULL

	- blkg_get_rl()
	- request allocation fails. sleep, drop queue lock
	- process is moved to a different cgroup. origincal cgroup is
	  deleted. pre_destroy will cleanup all blkg on blkcg.
	- process wakes up, request allocated, set_request sets up new blkg
 	  based on new cgroup. Now a request is queued in one blkg/cgroup and
 	  it has come out of the quota of other blkg/cgroup.

Well, I have a question. Ideally nobody should be linking any more blkg
to a blkcg once blkg_pre_destroy() has been called? But can it happen
that bio_associate_current() takes are reference to blkcg and bio is
throttled. cgroup associated with bio is deleted resulting in
pre_destroy(). Now bio is submitted to CFQ and it will try to create
a new blkg for blkcg-queue pair and once IO is complete, bio will drop
blkcg reference, which in turn will free up blkcg and associated blkg
is still around and will not be cleaned up.

IOW, looks like we need a mechanism to mark a blkcg dead (set in
pre_destroy() call) and any submissions to blkcg after that should result
in bio being divered to root group?

If we reuse the rl->blkg during CFQ submission, we will avoid that problem
as blkg IO is being submitted has already been disconnected from blkcg
list and hopefully that's not a problem and IO can still be submitted
in this blkg.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
  2012-04-27 19:46         ` Vivek Goyal
@ 2012-04-27 20:15             ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 20:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

Hello, Vivek.

On Fri, Apr 27, 2012 at 03:46:54PM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:21PM -0700, Tejun Heo wrote:
> 
> [..]
> > @@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> >  		goto fail_alloc;
> >  
> >  	blk_rq_init(q, rq);
> > +	blk_rq_set_rl(rq, rl);
> 
> Given the fact that we have established the rq and blkg releation at
> the time of allocation, should we modify CFQ to just use that relation
> instread of trying to lookup group again based on bio.

Maybe, given the lookup cache it shouldn't really matter tho.

> We avoid one lookup also we avoid duplicate creation of blkg in following
> corner case of bio==NULL
> 
> 	- blkg_get_rl()
> 	- request allocation fails. sleep, drop queue lock
> 	- process is moved to a different cgroup. origincal cgroup is
> 	  deleted. pre_destroy will cleanup all blkg on blkcg.
> 	- process wakes up, request allocated, set_request sets up new blkg
>  	  based on new cgroup. Now a request is queued in one blkg/cgroup and
>  	  it has come out of the quota of other blkg/cgroup.

I don't think it really matters as long as the request gets freed to
the right queue on completion.

> Well, I have a question. Ideally nobody should be linking any more blkg
> to a blkcg once blkg_pre_destroy() has been called? But can it happen
> that bio_associate_current() takes are reference to blkcg and bio is
> throttled. cgroup associated with bio is deleted resulting in
> pre_destroy(). Now bio is submitted to CFQ and it will try to create
> a new blkg for blkcg-queue pair and once IO is complete, bio will drop
> blkcg reference, which in turn will free up blkcg and associated blkg
> is still around and will not be cleaned up.
> 
> IOW, looks like we need a mechanism to mark a blkcg dead (set in
> pre_destroy() call) and any submissions to blkcg after that should result
> in bio being divered to root group?

Don't we already have that with css_tryget()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 20:15             ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 20:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Hello, Vivek.

On Fri, Apr 27, 2012 at 03:46:54PM -0400, Vivek Goyal wrote:
> On Thu, Apr 26, 2012 at 02:59:21PM -0700, Tejun Heo wrote:
> 
> [..]
> > @@ -926,6 +936,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> >  		goto fail_alloc;
> >  
> >  	blk_rq_init(q, rq);
> > +	blk_rq_set_rl(rq, rl);
> 
> Given the fact that we have established the rq and blkg releation at
> the time of allocation, should we modify CFQ to just use that relation
> instread of trying to lookup group again based on bio.

Maybe, given the lookup cache it shouldn't really matter tho.

> We avoid one lookup also we avoid duplicate creation of blkg in following
> corner case of bio==NULL
> 
> 	- blkg_get_rl()
> 	- request allocation fails. sleep, drop queue lock
> 	- process is moved to a different cgroup. origincal cgroup is
> 	  deleted. pre_destroy will cleanup all blkg on blkcg.
> 	- process wakes up, request allocated, set_request sets up new blkg
>  	  based on new cgroup. Now a request is queued in one blkg/cgroup and
>  	  it has come out of the quota of other blkg/cgroup.

I don't think it really matters as long as the request gets freed to
the right queue on completion.

> Well, I have a question. Ideally nobody should be linking any more blkg
> to a blkcg once blkg_pre_destroy() has been called? But can it happen
> that bio_associate_current() takes are reference to blkcg and bio is
> throttled. cgroup associated with bio is deleted resulting in
> pre_destroy(). Now bio is submitted to CFQ and it will try to create
> a new blkg for blkcg-queue pair and once IO is complete, bio will drop
> blkcg reference, which in turn will free up blkcg and associated blkg
> is still around and will not be cleaned up.
> 
> IOW, looks like we need a mechanism to mark a blkcg dead (set in
> pre_destroy() call) and any submissions to blkcg after that should result
> in bio being divered to root group?

Don't we already have that with css_tryget()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]             ` <20120427201516.GJ26595-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 20:21               ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 20:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w

On Fri, Apr 27, 2012 at 01:15:16PM -0700, Tejun Heo wrote:

[..]
> > IOW, looks like we need a mechanism to mark a blkcg dead (set in
> > pre_destroy() call) and any submissions to blkcg after that should result
> > in bio being divered to root group?
> 
> Don't we already have that with css_tryget()?

Ok, forgot about css_tryget(). Yes that should work. Thanks. Introduction
of cgroup has made locking and lifetime rules little complicated. :-)

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
       [not found]             ` <20120427201516.GJ26595-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-04-27 20:21               ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 20:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

On Fri, Apr 27, 2012 at 01:15:16PM -0700, Tejun Heo wrote:

[..]
> > IOW, looks like we need a mechanism to mark a blkcg dead (set in
> > pre_destroy() call) and any submissions to blkcg after that should result
> > in bio being divered to root group?
> 
> Don't we already have that with css_tryget()?

Ok, forgot about css_tryget(). Yes that should work. Thanks. Introduction
of cgroup has made locking and lifetime rules little complicated. :-)

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/11] blkcg: implement per-blkg request allocation
@ 2012-04-27 20:21               ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2012-04-27 20:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Apr 27, 2012 at 01:15:16PM -0700, Tejun Heo wrote:

[..]
> > IOW, looks like we need a mechanism to mark a blkcg dead (set in
> > pre_destroy() call) and any submissions to blkcg after that should result
> > in bio being divered to root group?
> 
> Don't we already have that with css_tryget()?

Ok, forgot about css_tryget(). Yes that should work. Thanks. Introduction
of cgroup has made locking and lifetime rules little complicated. :-)

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload
       [not found]     ` <1335477561-11131-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2012-04-27 14:42         ` Vivek Goyal
@ 2012-04-27 21:18       ` Tejun Heo
  1 sibling, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 21:18 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask.  Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage.  Instead of updating,
    just drop it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
We don't need preloading at all.  Just drop it.  git branch updated
accordingly.  All patches apply as-is.

Thanks.

 block/blk-cgroup.c |   10 +---------
 1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ab7420..af61db0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -177,7 +177,6 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 					     struct request_queue *q)
-	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -203,10 +202,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
-	ret = radix_tree_preload(GFP_ATOMIC);
-	if (ret)
-		goto err_free;
-
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -215,14 +210,11 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	}
 	spin_unlock(&blkcg->lock);
 
-	radix_tree_preload_end();
-
 	if (!ret)
 		return blkg;
-err_free:
-	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
+	blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload
       [not found]     ` <1335477561-11131-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-27 21:18       ` Tejun Heo
  2012-04-27 21:18       ` [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload Tejun Heo
  1 sibling, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 21:18 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask.  Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage.  Instead of updating,
    just drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
We don't need preloading at all.  Just drop it.  git branch updated
accordingly.  All patches apply as-is.

Thanks.

 block/blk-cgroup.c |   10 +---------
 1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ab7420..af61db0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -177,7 +177,6 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 					     struct request_queue *q)
-	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -203,10 +202,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
-	ret = radix_tree_preload(GFP_ATOMIC);
-	if (ret)
-		goto err_free;
-
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -215,14 +210,11 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	}
 	spin_unlock(&blkcg->lock);
 
-	radix_tree_preload_end();
-
 	if (!ret)
 		return blkg;
-err_free:
-	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
+	blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload
@ 2012-04-27 21:18       ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 21:18 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask.  Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage.  Instead of updating,
    just drop it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
We don't need preloading at all.  Just drop it.  git branch updated
accordingly.  All patches apply as-is.

Thanks.

 block/blk-cgroup.c |   10 +---------
 1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ab7420..af61db0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -177,7 +177,6 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 					     struct request_queue *q)
-	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -203,10 +202,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
-	ret = radix_tree_preload(GFP_ATOMIC);
-	if (ret)
-		goto err_free;
-
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -215,14 +210,11 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	}
 	spin_unlock(&blkcg->lock);
 
-	radix_tree_preload_end();
-
 	if (!ret)
 		return blkg;
-err_free:
-	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
+	blkg_free(blkg);
 	return ERR_PTR(ret);
 }
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH UPDATED 03/11] blkcg: make root blkcg allocation use %GFP_KERNEL
  2012-04-26 21:59     ` Tejun Heo
@ 2012-04-27 21:19         ` Tejun Heo
  -1 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 21:19 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
   instead of ERR_PTR() value.  Fixed.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
Failure path bug fixed.  git branch updated accordingly.  All other patches
apply as-is.

Thanks.

 block/blk-cgroup.c |   59 +++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index af61db0..cbeeb54 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -91,16 +91,18 @@ static void blkg_free(struct blkcg_gq *blkg)
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
  * @q: request_queue the new blkg is associated with
+ * @gfp_mask: allocation mask to use
  *
  * Allocate a new blkg assocating @blkcg and @q.
  */
-static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
+static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
+				   gfp_t gfp_mask)
 {
 	struct blkcg_gq *blkg;
 	int i;
 
 	/* alloc and init base part */
-	blkg = kzalloc_node(sizeof(*blkg), GFP_ATOMIC, q->node);
+	blkg = kzalloc_node(sizeof(*blkg), gfp_mask, q->node);
 	if (!blkg)
 		return NULL;
 
@@ -117,7 +119,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 			continue;
 
 		/* alloc per-policy data and attach it to blkg */
-		pd = kzalloc_node(pol->pd_size, GFP_ATOMIC, q->node);
+		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
 		if (!pd) {
 			blkg_free(blkg);
 			return NULL;
@@ -175,8 +177,13 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
+/*
+ * If @new_blkg is %NULL, this function tries to allocate a new one as
+ * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
+ */
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q)
+					     struct request_queue *q,
+					     struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -188,18 +195,24 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	blkg = __blkg_lookup(blkcg, q);
 	if (blkg) {
 		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		return blkg;
+		goto out_free;
 	}
 
 	/* blkg holds a reference to blkcg */
-	if (!css_tryget(&blkcg->css))
-		return ERR_PTR(-EINVAL);
+	if (!css_tryget(&blkcg->css)) {
+		blkg = ERR_PTR(-EINVAL);
+		goto out_free;
+	}
 
 	/* allocate */
-	ret = -ENOMEM;
-	blkg = blkg_alloc(blkcg, q);
-	if (unlikely(!blkg))
-		goto err_put;
+	if (!new_blkg) {
+		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
+		if (unlikely(!new_blkg)) {
+			blkg = ERR_PTR(-ENOMEM);
+			goto out_put;
+		}
+	}
+	blkg = new_blkg;
 
 	/* insert */
 	spin_lock(&blkcg->lock);
@@ -212,10 +225,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 
 	if (!ret)
 		return blkg;
-err_put:
+
+	blkg = ERR_PTR(ret);
+out_put:
 	css_put(&blkcg->css);
-	blkg_free(blkg);
-	return ERR_PTR(ret);
+out_free:
+	blkg_free(new_blkg);
+	return blkg;
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -227,7 +243,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q);
+	return __blkg_lookup_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -727,19 +743,30 @@ int blkcg_activate_policy(struct request_queue *q,
 	struct blkcg_gq *blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
+	bool preloaded;
 
 	if (blkcg_policy_enabled(q, pol))
 		return 0;
 
+	/* preallocations for root blkg */
+	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!blkg)
+		return -ENOMEM;
+
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	blk_queue_bypass_start(q);
 
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q);
+	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
 	rcu_read_unlock();
 
+	if (preloaded)
+		radix_tree_preload_end();
+
 	if (IS_ERR(blkg)) {
 		ret = PTR_ERR(blkg);
 		goto out_unlock;
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH UPDATED 03/11] blkcg: make root blkcg allocation use %GFP_KERNEL
@ 2012-04-27 21:19         ` Tejun Heo
  0 siblings, 0 replies; 77+ messages in thread
From: Tejun Heo @ 2012-04-27 21:19 UTC (permalink / raw)
  To: axboe
  Cc: vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm

Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
   instead of ERR_PTR() value.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
Failure path bug fixed.  git branch updated accordingly.  All other patches
apply as-is.

Thanks.

 block/blk-cgroup.c |   59 +++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index af61db0..cbeeb54 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -91,16 +91,18 @@ static void blkg_free(struct blkcg_gq *blkg)
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
  * @q: request_queue the new blkg is associated with
+ * @gfp_mask: allocation mask to use
  *
  * Allocate a new blkg assocating @blkcg and @q.
  */
-static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
+static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
+				   gfp_t gfp_mask)
 {
 	struct blkcg_gq *blkg;
 	int i;
 
 	/* alloc and init base part */
-	blkg = kzalloc_node(sizeof(*blkg), GFP_ATOMIC, q->node);
+	blkg = kzalloc_node(sizeof(*blkg), gfp_mask, q->node);
 	if (!blkg)
 		return NULL;
 
@@ -117,7 +119,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q)
 			continue;
 
 		/* alloc per-policy data and attach it to blkg */
-		pd = kzalloc_node(pol->pd_size, GFP_ATOMIC, q->node);
+		pd = kzalloc_node(pol->pd_size, gfp_mask, q->node);
 		if (!pd) {
 			blkg_free(blkg);
 			return NULL;
@@ -175,8 +177,13 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
+/*
+ * If @new_blkg is %NULL, this function tries to allocate a new one as
+ * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
+ */
 static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q)
+					     struct request_queue *q,
+					     struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -188,18 +195,24 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	blkg = __blkg_lookup(blkcg, q);
 	if (blkg) {
 		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		return blkg;
+		goto out_free;
 	}
 
 	/* blkg holds a reference to blkcg */
-	if (!css_tryget(&blkcg->css))
-		return ERR_PTR(-EINVAL);
+	if (!css_tryget(&blkcg->css)) {
+		blkg = ERR_PTR(-EINVAL);
+		goto out_free;
+	}
 
 	/* allocate */
-	ret = -ENOMEM;
-	blkg = blkg_alloc(blkcg, q);
-	if (unlikely(!blkg))
-		goto err_put;
+	if (!new_blkg) {
+		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
+		if (unlikely(!new_blkg)) {
+			blkg = ERR_PTR(-ENOMEM);
+			goto out_put;
+		}
+	}
+	blkg = new_blkg;
 
 	/* insert */
 	spin_lock(&blkcg->lock);
@@ -212,10 +225,13 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 
 	if (!ret)
 		return blkg;
-err_put:
+
+	blkg = ERR_PTR(ret);
+out_put:
 	css_put(&blkcg->css);
-	blkg_free(blkg);
-	return ERR_PTR(ret);
+out_free:
+	blkg_free(new_blkg);
+	return blkg;
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -227,7 +243,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q);
+	return __blkg_lookup_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -727,19 +743,30 @@ int blkcg_activate_policy(struct request_queue *q,
 	struct blkcg_gq *blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
+	bool preloaded;
 
 	if (blkcg_policy_enabled(q, pol))
 		return 0;
 
+	/* preallocations for root blkg */
+	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!blkg)
+		return -ENOMEM;
+
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	blk_queue_bypass_start(q);
 
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q);
+	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
 	rcu_read_unlock();
 
+	if (preloaded)
+		radix_tree_preload_end();
+
 	if (IS_ERR(blkg)) {
 		ret = PTR_ERR(blkg);
 		goto out_unlock;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc}
       [not found]     ` <1335477561-11131-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-05-02  4:55       ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2012-05-02  4:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, James Smart,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On 04/26/2012 04:59 PM, Tejun Heo wrote:
> iscsi_remove_host() uses bsg_remove_queue() which implements custom
> queue draining.  fc_bsg_remove() open-codes mostly identical logic.
> 
> The draining logic isn't correct in that blk_stop_queue() doesn't
> prevent new requests from being queued - it just stops processing, so
> nothing prevents new requests to be queued after the logic determines
> that the queue is drained.
> 
> blk_cleanup_queue() now implements proper queue draining and these
> custom draining logics aren't necessary.  Drop them and use
> bsg_unregister_queue() + blk_cleanup_queue() instead.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>
> Cc: Mike Christie <michaelc-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
> Cc: James Smart <james.smart-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
> ---
>  block/bsg-lib.c                     |   53 -----------------------------------
>  drivers/scsi/scsi_transport_fc.c    |   38 -------------------------
>  drivers/scsi/scsi_transport_iscsi.c |    2 +-

iSCSI changes worked ok for me. I replicated the problem that the old
code was supposed to fix and verified the new code worked ok. I also
tested the FC code briefly (I do not have a proper setup to really
stress it), and it worked ok.

Thanks for killing that code for us.

Reviewed-by: Mike Christie <michaelc@.cs.wisc.edu>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc}
       [not found]     ` <1335477561-11131-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-05-02  4:55       ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2012-05-02  4:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, vgoyal, ctalbott, rni, linux-kernel, cgroups, containers,
	fengguang.wu, hughd, akpm, James Bottomley, James Smart

On 04/26/2012 04:59 PM, Tejun Heo wrote:
> iscsi_remove_host() uses bsg_remove_queue() which implements custom
> queue draining.  fc_bsg_remove() open-codes mostly identical logic.
> 
> The draining logic isn't correct in that blk_stop_queue() doesn't
> prevent new requests from being queued - it just stops processing, so
> nothing prevents new requests to be queued after the logic determines
> that the queue is drained.
> 
> blk_cleanup_queue() now implements proper queue draining and these
> custom draining logics aren't necessary.  Drop them and use
> bsg_unregister_queue() + blk_cleanup_queue() instead.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> Cc: Mike Christie <michaelc@cs.wisc.edu>
> Cc: James Smart <james.smart@emulex.com>
> ---
>  block/bsg-lib.c                     |   53 -----------------------------------
>  drivers/scsi/scsi_transport_fc.c    |   38 -------------------------
>  drivers/scsi/scsi_transport_iscsi.c |    2 +-

iSCSI changes worked ok for me. I replicated the problem that the old
code was supposed to fix and verified the new code worked ok. I also
tested the FC code briefly (I do not have a proper setup to really
stress it), and it worked ok.

Thanks for killing that code for us.

Reviewed-by: Mike Christie <michaelc@.cs.wisc.edu>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc}
@ 2012-05-02  4:55       ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2012-05-02  4:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, James Bottomley,
	James Smart

On 04/26/2012 04:59 PM, Tejun Heo wrote:
> iscsi_remove_host() uses bsg_remove_queue() which implements custom
> queue draining.  fc_bsg_remove() open-codes mostly identical logic.
> 
> The draining logic isn't correct in that blk_stop_queue() doesn't
> prevent new requests from being queued - it just stops processing, so
> nothing prevents new requests to be queued after the logic determines
> that the queue is drained.
> 
> blk_cleanup_queue() now implements proper queue draining and these
> custom draining logics aren't necessary.  Drop them and use
> bsg_unregister_queue() + blk_cleanup_queue() instead.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>
> Cc: Mike Christie <michaelc-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
> Cc: James Smart <james.smart-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
> ---
>  block/bsg-lib.c                     |   53 -----------------------------------
>  drivers/scsi/scsi_transport_fc.c    |   38 -------------------------
>  drivers/scsi/scsi_transport_iscsi.c |    2 +-

iSCSI changes worked ok for me. I replicated the problem that the old
code was supposed to fix and verified the new code worked ok. I also
tested the FC code briefly (I do not have a proper setup to really
stress it), and it worked ok.

Thanks for killing that code for us.

Reviewed-by: Mike Christie <michaelc@.cs.wisc.edu>

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2012-05-02  4:59 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-26 21:59 [PATCHSET] block: implement per-blkg request allocation Tejun Heo
2012-04-26 21:59 ` Tejun Heo
     [not found] ` <1335477561-11131-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-26 21:59   ` [PATCH 01/11] blkcg: fix blkg_alloc() failure path Tejun Heo
2012-04-26 21:59     ` Tejun Heo
     [not found]     ` <1335477561-11131-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-27 14:26       ` Vivek Goyal
2012-04-27 14:26     ` Vivek Goyal
2012-04-27 14:26       ` Vivek Goyal
2012-04-27 14:27       ` Tejun Heo
2012-04-27 14:27         ` Tejun Heo
     [not found]       ` <20120427142652.GH10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 14:27         ` Tejun Heo
2012-04-26 21:59   ` [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-27 21:18     ` [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload Tejun Heo
2012-04-27 21:18       ` Tejun Heo
     [not found]     ` <1335477561-11131-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-27 14:42       ` [PATCH 02/11] blkcg: __blkg_lookup_create() doesn't have to fail on radix tree preload failure Vivek Goyal
2012-04-27 14:42         ` Vivek Goyal
2012-04-27 14:47         ` Tejun Heo
2012-04-27 14:47           ` Tejun Heo
     [not found]         ` <20120427144258.GI10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 14:47           ` Tejun Heo
2012-04-27 21:18       ` [PATCH UPDATED 02/11] blkcg: __blkg_lookup_create() doesn't need radix preload Tejun Heo
2012-04-26 21:59   ` [PATCH 03/11] blkcg: make root blkcg allocation use %GFP_KERNEL Tejun Heo
2012-04-26 21:59     ` Tejun Heo
     [not found]     ` <1335477561-11131-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-27 21:19       ` [PATCH UPDATED " Tejun Heo
2012-04-27 21:19         ` Tejun Heo
2012-04-26 21:59   ` [PATCH 04/11] mempool: add @gfp_mask to mempool_create_node() Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-26 21:59   ` [PATCH 05/11] block: drop custom queue draining used by scsi_transport_{iscsi|fc} Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-05-02  4:55     ` Mike Christie
2012-05-02  4:55       ` Mike Christie
     [not found]     ` <1335477561-11131-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-05-02  4:55       ` Mike Christie
2012-04-26 21:59   ` [PATCH 06/11] block: refactor get_request[_wait]() Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-26 21:59   ` [PATCH 07/11] block: allocate io_context upfront Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-26 21:59   ` [PATCH 08/11] blkcg: inline bio_blkcg() and friends Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-26 21:59   ` [PATCH 09/11] block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv Tejun Heo
2012-04-26 21:59   ` [PATCH 10/11] block: prepare for multiple request_lists Tejun Heo
2012-04-26 21:59     ` Tejun Heo
2012-04-26 21:59   ` [PATCH 11/11] blkcg: implement per-blkg request allocation Tejun Heo
2012-04-26 21:59     ` Tejun Heo
     [not found]     ` <1335477561-11131-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-27 14:54       ` Jeff Moyer
2012-04-27 14:54         ` Jeff Moyer
     [not found]         ` <x49wr51usxi.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2012-04-27 15:02           ` Tejun Heo
2012-04-27 15:02         ` Tejun Heo
2012-04-27 15:02           ` Tejun Heo
     [not found]           ` <20120427150217.GK27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-27 15:40             ` Vivek Goyal
2012-04-27 15:40               ` Vivek Goyal
2012-04-27 15:45               ` Tejun Heo
2012-04-27 15:45                 ` Tejun Heo
2012-04-27 15:48                 ` Vivek Goyal
2012-04-27 15:48                   ` Vivek Goyal
     [not found]                   ` <20120427154841.GA16237-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 15:51                     ` Tejun Heo
2012-04-27 15:51                       ` Tejun Heo
     [not found]                       ` <20120427155140.GN27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-27 15:56                         ` Vivek Goyal
2012-04-27 15:56                           ` Vivek Goyal
     [not found]                           ` <20120427155612.GK10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 16:19                             ` Vivek Goyal
2012-04-27 16:19                               ` Vivek Goyal
2012-04-27 16:20                             ` Tejun Heo
2012-04-27 16:20                               ` Tejun Heo
2012-04-27 17:21                               ` Vivek Goyal
2012-04-27 17:21                                 ` Vivek Goyal
2012-04-27 17:25                                 ` Tejun Heo
2012-04-27 17:25                                   ` Tejun Heo
     [not found]                                 ` <20120427172110.GM10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 17:25                                   ` Tejun Heo
     [not found]                               ` <20120427162012.GP27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-27 17:21                                 ` Vivek Goyal
     [not found]                 ` <20120427154502.GM27486-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-27 15:48                   ` Vivek Goyal
     [not found]               ` <20120427154033.GJ10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 15:45                 ` Tejun Heo
2012-04-27 19:46       ` Vivek Goyal
2012-04-27 19:46         ` Vivek Goyal
     [not found]         ` <20120427194654.GN10579-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-04-27 20:15           ` Tejun Heo
2012-04-27 20:15             ` Tejun Heo
     [not found]             ` <20120427201516.GJ26595-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-04-27 20:21               ` Vivek Goyal
2012-04-27 20:21             ` Vivek Goyal
2012-04-27 20:21               ` Vivek Goyal
2012-04-26 21:59 ` [PATCH 09/11] block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.